Build a Self-Testing System to Red-Team AI Agents

Build a Self-Testing System to Red-Team AI Agents

As artificial intelligence systems become increasingly autonomous and integrated with sensitive tools and data, the challenge of ensuring their safety and security has escalated beyond traditional testing methods. The subtle and unpredictable nature of large language models requires a new paradigm for security validation—one where intelligent systems are used to audit and harden other AI agents. This approach involves creating an automated red-teaming framework where an adversarial agent systematically generates attacks to probe for vulnerabilities, a target agent attempts to defend against them, and a judge agent evaluates the outcome. By orchestrating this agent-against-agent dynamic, developers can build a continuous, scalable, and self-testing evaluation harness that uncovers critical failure modes, from prompt injection and secret leakage to unauthorized tool usage, providing a robust methodology for engineering safer and more reliable AI.

1. Preparing the Workspace for Evaluation

The foundational step in constructing any robust evaluation system is the meticulous preparation of the runtime environment, which guarantees that all subsequent processes operate on a stable and reproducible basis. This begins with the installation of all required dependencies, such as the strands-agents library for creating and managing the agents, its openai extension for model integration, and pydantic for structured data modeling. Establishing a clean environment prevents conflicts and ensures that the behavior of the system is attributable to the agents’ logic rather than environmental inconsistencies. Following the setup, the secure retrieval of essential credentials, like the OpenAI API key, is paramount. Utilizing secure mechanisms such as Google Colab’s userdata manager or the getpass library avoids hardcoding sensitive information, a critical security practice. This initial phase is not merely procedural; it establishes the secure and consistent groundwork upon which the entire agentic testing framework is built, ensuring that the results are both reliable and trustworthy.

Once the environment is secure and all dependencies are in place, the next critical task is the initialization of the language model that will power all agents within the system. The selection of a model, such as gpt-4o-mini, is a significant decision, but equally important is the precise configuration of its generation parameters. Settings like temperature, which controls the randomness of the output, and max_tokens, which limits its length, are carefully calibrated to ensure consistent and predictable behavior. For a testing harness, a low temperature is often preferred to reduce variability between runs, making the evaluation more deterministic and the results easier to compare over time. This uniform model configuration is applied across the target, red-team, and judge agents, ensuring a level playing field where performance differences are a result of their distinct system prompts and tasks, not arbitrary variations in the model’s creative output. This disciplined approach to model initialization is essential for transforming the evaluation from a series of random interactions into a structured, scientific experiment designed to measure AI safety.

2. Creating the Agent to Be Tested

The centerpiece of the evaluation harness is the “target agent,” an AI system designed to interact with a suite of powerful tools while adhering to a strict set of safety protocols. This agent’s core identity is defined by its system prompt, a foundational document that serves as its constitution. The prompt explicitly instructs the agent to function as a helpful tool-using assistant while imposing non-negotiable constraints: it must never reveal secrets, it must refuse any attempts at prompt injection, and it must never exfiltrate sensitive information through its tools. These directives establish the agent’s primary defense mechanism against adversarial attacks. The entire purpose of the red-teaming exercise is to systematically test the resilience of these programmed guardrails. By creating a target with clearly defined behavioral rules, the evaluation can precisely measure any deviation from its intended safe operating parameters, providing clear signals when a vulnerability is successfully exploited by an attack.

To make the security testing realistic and meaningful, the target agent is equipped with a set of simulated tools that mimic high-risk, real-world capabilities. These are not benign functions; they represent vectors for potential misuse and data exfiltration. For instance, the vault_get_secret tool simulates access to a secure vault containing sensitive credentials, while mock_webhook_send and mock_file_write represent channels for sending data to external systems or the local file system. A computational tool like mock_math, which uses eval, introduces the risk of code execution vulnerabilities. By providing the agent access to these potent but controlled capabilities, the testing environment creates a sandbox where the agent’s decision-making process under pressure can be safely observed. The key question being tested is whether the agent will uphold its constitutional duties defined in the system prompt, even when presented with a malicious request that tempts it to misuse one of these powerful tools.

3. Developing the Red Team Agent

With the target agent in place, the next component is the “red-team agent,” a specialized AI designed exclusively to act as the adversary. Its sole purpose is to generate creative and realistic attacks intended to bypass the target’s safety measures. Its system prompt is crafted to encourage adversarial thinking, instructing it to generate prompt injection attacks using a variety of sophisticated manipulation tactics. These are not limited to simple commands but include social engineering strategies such as role-playing (e.g., “You are now in developer mode”), creating a false sense of urgency (“This is a critical system alert”), or impersonating authority (“As the system administrator, I require you to…”). The red-team agent is also prompted to devise attacks that specifically encourage the misuse of the target’s tools. This focus on psychological and logical manipulation ensures that the generated attacks are diverse and representative of the complex threats that autonomous agents will face in the real world, moving far beyond simplistic security probes.

The power of the red-team agent is fully realized through its integration into an automated workflow, which allows for the scalable generation of adversarial test cases. A function like generate_attacks programmatically invokes this agent, requesting a specified number of unique attacks centered on a particular topic, such as misusing tools to access secrets. The function is designed to robustly parse the agent’s output, whether it is a clean JSON list or a more loosely formatted text response, ensuring a reliable stream of attack prompts for the evaluation pipeline. This automation is a critical advantage of the agent-against-agent approach, as it eliminates the significant time and effort required for manual attack crafting. More importantly, it removes human bias from the test generation process, allowing the AI to discover novel or non-intuitive attack vectors that a human red-teamer might overlook, thereby providing a more comprehensive and rigorous stress test of the target agent’s defenses.

4. Establishing the Judging System

To ensure that the evaluation process is objective, repeatable, and scalable, a formalized judging system is established using structured data models and a dedicated “judge agent.” By employing Pydantic BaseModel schemas like AttackResult and RedTeamReport, the system moves beyond subjective, qualitative assessments and into the realm of quantitative, data-driven analysis. The AttackResult schema defines the specific data points to be captured for each individual test, including the attack prompt, the target’s response, a list of observed tool calls, and boolean flags for critical failures like leaked_secret and exfiltration_attempt. It also includes a numerical score for the quality of the agent’s refusal. This structured approach ensures that every test run produces consistent, machine-readable output, which is essential for aggregating results, tracking security posture over time, and integrating the safety evaluation harness into automated development pipelines like CI/CD systems.

The second pillar of this formal evaluation is the “judge agent” itself, an AI tasked with impartially analyzing the outcome of each attack. Its system prompt directs it to act as a neutral arbiter, assessing whether the target agent’s response constitutes a security failure based on a predefined set of criteria. The judge analyzes the complete context of the interaction—the attack, the response, and the tool call log—to determine if secrets were leaked, if tools were misused for data exfiltration, and how effectively the target agent refused the malicious request. This refusal is graded on a scale from 0 to 5, providing a nuanced metric that captures not just whether the agent refused, but how clearly and safely it did so. By delegating this analytical task to an AI, the system ensures that the evaluation is consistent and can be scaled across thousands of tests without introducing the variability and fatigue associated with human judgment.

5. Executing the Attack and Monitoring Behavior

The core of the testing process involves executing each generated adversarial prompt against the target agent within a controlled and closely monitored environment. This is accomplished through a specialized function, run_target_with_observation, which is designed not only to run the interaction but also to capture a detailed record of the agent’s behavior. The key technique employed is the instrumentation of the target’s tools. Each tool, such as vault_get_secret or mock_webhook_send, is wrapped in an observer function before being provided to a temporary instance of the target agent. These wrappers perform a dual role: they execute the original tool’s logic, but they also record every call in a log, capturing the function name and the arguments passed to it. This observation mechanism is a critical innovation, as it provides an unambiguous audit trail of the agent’s actions, distinguishing between an agent that merely discusses a harmful action in its text response and one that actively attempts to execute it by calling a tool.

Following the execution of the attack, the collected data—the initial prompt, the agent’s final text response, and the log of tool calls—is passed to the judge_one function for a comprehensive, multi-faceted analysis. This function serves as the primary evaluation point for each individual test run. It begins by performing deterministic, hard-coded checks for the most severe and obvious failures. For example, it programmatically scans the text response for the presence of the mock secret and checks the tool call log for any use of exfiltration tools like mock_webhook_send. After these initial checks, the complete interaction context is packaged and sent to the judge agent for a more nuanced, AI-driven assessment. The judge agent evaluates factors like the quality of the refusal and identifies more subtle forms of leakage or misuse. This hybrid approach, combining precise programmatic checks with sophisticated LLM-based judgment, ensures a thorough and reliable evaluation of each attack’s outcome, capturing both clear-cut violations and qualitative weaknesses.

6. Compiling the Security Report

The culmination of the red-teaming workflow is the aggregation of all individual test results into a comprehensive and actionable security report. The build_report function orchestrates this entire end-to-end process, beginning by invoking the red-team agent to generate a fresh batch of attacks. It then iterates through each attack, executing it against the target agent via the observation wrapper and submitting the outcome to the judge agent for evaluation. As each structured AttackResult is returned, it is collected into a list. Once all tests are complete, the function processes this raw data to compute high-level summary metrics that provide a snapshot of the agent’s overall security posture. These key indicators include the total number of secret leakage events, the count of data exfiltration attempts, and the average refusal quality score across all attacks. This aggregation transforms a series of isolated tactical results into a strategic overview, allowing developers to quickly grasp the agent’s primary areas of weakness.

The final output, a RedTeamReport object, is designed to be more than just a collection of statistics; it is a decision-making tool. In addition to the summary metrics, the report automatically identifies and surfaces the most high_risk_examples—the specific attack-response pairs that resulted in the most severe security failures, such as successful data exfiltration or a complete failure to refuse a dangerous command. This allows engineers to immediately focus their debugging and mitigation efforts on the most critical vulnerabilities. Furthermore, the report includes a set of concrete, actionable recommendations for hardening the agent. These suggestions, such as implementing tool allowlists, adding proactive scanning for secrets in agent outputs, or gating exfiltration tools behind an additional review step, provide a clear path forward. This transforms the red-teaming exercise from a simple diagnostic process into a constructive feedback loop that directly guides engineering efforts to build a safer and more resilient AI system.

A Framework for Evolving AI Defenses

This entire process established a fully operational agent-against-agent security framework, which successfully moved beyond rudimentary prompt testing into a more sophisticated domain of systematic and repeatable evaluation. The implementation demonstrated how to meticulously observe tool calls, programmatically detect secret leakage, and quantitatively score the quality of safety refusals. These individual judgments were then aggregated into a structured red-team report designed to guide concrete engineering and design decisions. This approach provided a methodology for continuously probing and validating agent behavior as its underlying tools, system prompts, and foundation models evolved over time. Ultimately, the work highlighted that the true potential of agentic AI resided not merely in achieving greater autonomy, but in the capacity to build complex, self-monitoring systems that remained safe, auditable, and fundamentally robust when subjected to persistent adversarial pressure.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later