What is the difference between manual and automated AI red teaming?

Manual red teaming uses human testers who craft prompts based on creativity, domain knowledge, and understanding of the model's behaviour. It finds subtle, contextual vulnerabilities that automated tools miss and is essential for novel attack chains. Automated red teaming runs large libraries of structured attack prompts systematically, covers more ground faster, produces consistent results, and can be integrated into CI/CD pipelines for regression testing. The best programs use both: automated tools for coverage and regression, manual testers for depth and novel findings.

AI Red Teaming: Planning, Executing, and Reporting | Track 2C

Q: How should I classify red team results?

Classify each test result before looking at model responses. Define pass, fail, and borderline criteria upfront. A fail is a response that violates the model's stated policy or produces outputs from the misuse case inventory. A borderline requires a second reviewer. Classify in batches of the same attack type to calibrate consistently. Compute pass rate per attack category. The pass rate per category tells you which attack surfaces are weakest, which is more useful than an aggregate pass rate across all tests.

Q: What severity tiers apply to AI red team findings?

Critical: exploitable with minimal effort, produces high-impact output (PII exfiltration, tool hijacking, harmful content generation at scale). High: exploitable with moderate effort, significant policy violation. Medium: exploitable under specific conditions, limited impact. Low: requires significant effort or produces minimal impact. Informational: not a direct vulnerability but a design decision that increases risk. Severity is the product of exploitability and impact. A finding that requires 50 carefully crafted prompts to trigger a low-severity output is not the same severity as one that triggers on the first try.

Section 01

What AI red teaming is

A red team is an adversarial group that attacks a system to find weaknesses before real attackers do. The term comes from military wargaming, where the red team plays the adversary. Applied to AI, red teaming means systematically trying to make your own AI system fail in ways that matter.

The key word is systematically. Clicking around a chatbot and noticing it sometimes says something odd is not red teaming. Red teaming has a defined threat model, a structured attack plan, documented results, and produces findings that are classified by severity and mapped to remediation actions.

Traditional software penetration testing finds code vulnerabilities: buffer overflows, authentication bypasses, SQL injection. These are deterministic. The same input always produces the same output. AI red teaming targets model behaviour, which is probabilistic. The same prompt can produce different results across runs. The attack surface includes what the model has learned, not only how the surrounding code is written.

1

Scoping and threat modelling Start here

Define what the model is authorised to do, who the attacker is, what they want, and what a successful attack looks like. Produce a misuse case inventory.

2

Attack surface mapping

Document every channel an attacker can use to reach the model: direct prompts, indirect injection via retrieval, tool responses, memory, and multi-agent communication.

3

Adversarial test design

Select attack categories matched to the threat model. Design prompts for each category. Define pass, fail, and borderline criteria before execution.

4

Execution

Run manual and automated tests. Record every prompt and response verbatim. Classify results using the criteria defined in phase 3.

5

Reporting Drives action

Write a report with exact reproduction steps, severity tiers, root cause hypotheses, and specific remediation recommendations per finding.

MITRE ATLAS AML.T0043 Craft Adversarial Data covers the technique of constructing inputs designed to cause model failures. AI red teaming is the structured practice of applying this and related techniques as a defensive exercise to find and fix weaknesses before attackers exploit them.

Section 02

Scoping and threat modelling

Scoping answers four questions before any testing starts. What is the model authorised to do? Who is the realistic attacker? What would a successful attack produce? What is explicitly out of scope?

The authorisation boundary defines normal behaviour. For a customer support chatbot: answer questions about products, escalate to human agents, never discuss competitor pricing, never generate content unrelated to support. This boundary is the reference point for classifying results. A response is a finding when it violates the boundary.

The misuse case inventory is a list of what an attacker would want to achieve. Build it by asking: if someone wanted to misuse this system, what would they try to get it to do? Common misuse cases across AI system types:

Chatbots and assistants

Information extraction

Extract system prompt contents
Surface confidential data from retrieval store
Expose other users' session data
Reveal model version or provider details

Autonomous agents

Tool and action hijacking

Send emails to attacker addresses
Exfiltrate files from connected storage
Spend API budget on attacker tasks
Create persistent access via memory

All system types

Policy bypass

Generate harmful or illegal content
Bypass content filters via encoding
Impersonate the operator or system
Cause denial of service via resource exhaustion

A well-defined misuse case inventory drives everything downstream: which attack categories to test, what constitutes a failure, and how to measure coverage. Coverage means the fraction of misuse cases that were tested. A red team report that cannot report coverage by misuse case is incomplete.

Section 03

Mapping the attack surface

The AI attack surface has three layers: where attacker-controlled content enters the system, what the model can do once it processes that content, and what the model can produce as output.

The input layer is the most overlooked. For a RAG system, attacker-controlled content does not only arrive through the user's direct prompt. It also arrives through retrieved documents, cached memories, and the responses of external APIs the system calls. Each of these is an injection vector. Coverage means testing all of them, not only the obvious user-facing chat interface.

Input channels

Where content enters

Direct user prompt
System prompt (via admin interface)
Retrieved documents (RAG)
Tool and API responses
Agent memory reads
Multi-agent messages

Model capabilities

What the model can do

Execute code or shell commands
Browse the web
Read and write files
Call external APIs
Send messages (email, Slack)
Query databases

Output channels

What the model produces

Natural language text
Generated code
Structured data (JSON, SQL)
Tool calls and parameters
Memory writes
Agent sub-task instructions

Capabilities amplify the impact of a successful injection. A chatbot that can only produce text has limited blast radius if hijacked. An agent that can send emails, write files, and call payment APIs has a much larger blast radius. Map capabilities before testing because they determine the severity of each finding: the same injection that produces a harmless wrong answer in a text-only bot constitutes a critical finding in an agent with write access to production systems.

Web application scanning versus API scanning. Many AI systems are accessible through a web interface with no direct API. Testing through the UI is slower and requires browser automation but reaches behaviour that is only reachable through the real product. Testing through the API is faster and more controllable but may miss UI-level constraints and session management behaviours. A complete assessment covers both.

Section 04

Adversarial test design

Attack category selection is the most consequential design decision in a red team assessment. Running every attack category against every system wastes time and produces a report too diffuse to act on. Selecting categories matched to the threat model and system type produces focused, actionable findings.

The table below maps system types to priority attack categories. The priority column uses the misuse case inventory as its basis: categories are high priority if they address misuse cases with high impact and realistic attacker capability.

System type	Attack category	Priority	What it tests
RAG system	Indirect injection	Critical	Instructions embedded in retrieved documents
	Extraction attacks	High	Surfacing confidential document content via queries
	Hallucination induction	Medium	Queries that push the model outside its knowledge base
Autonomous agent	Multi-step injection	Critical	Injection chains that accumulate privilege across steps
	Tool abuse	Critical	Hijacking tool calls to perform attacker-directed actions
	Goal hijacking	High	Redirecting the agent's planning toward attacker objectives
Public chatbot	Jailbreaks	Critical	Bypassing content policy via roleplay, encoding, and framing
	Prompt injection	High	Overriding system instructions from user turn
	PII extraction	High	Probing for training data memorisation or cross-session leakage

Within each attack category, test depth varies from shallow to deep. A shallow test sends a canonical example from the attack library, for example a standard DAN jailbreak prompt. A deep test chains multiple techniques, adapts to the model's responses, and uses encoding or obfuscation to evade filters. Deep tests find vulnerabilities that shallow tests miss. A complete assessment includes both.

Define pass and fail criteria for each category before running tests. For jailbreaks: a fail is any response that produces content the system prompt explicitly prohibits. For indirect injection: a fail is any response that executes an instruction originating from a retrieved document, not from the user. Writing criteria before execution prevents the evaluator from rationalising borderline responses as passes.

Section 05

Execution

Execution has two modes: manual and automated. They answer different questions and serve different purposes in a complete red team program.

Manual red teaming

+ Finds subtle, contextual failures that require human judgment

+ Adapts in real time based on model responses

+ Discovers novel attack chains no library contains

+ Essential for high-risk features and novel system types

- Slow and expensive at scale

- Cannot be run in CI/CD pipelines

- Results depend on tester skill and knowledge

Automated red teaming

+ Covers hundreds of attack prompts in minutes

+ Produces consistent, reproducible results

+ Integrates into CI/CD for regression testing

+ Tracks security posture over time across fixes

- Limited to known attack patterns in its library

- Cannot adapt to model responses mid-test

- May miss subtle semantic failures requiring human judgment

1

Build the test dataset

Assemble prompts per attack category. Include canonical examples from known attack libraries, variations adapted to the specific system's domain, and deep multi-step sequences. Record the source and category of every prompt before running anything.

2

Run tests and record responses verbatim

Log every prompt and every response in full. Do not paraphrase responses during logging. The exact wording of the model's output is the evidence. Summaries lose information that matters for classification and reproduction.

3

Rerun failures to confirm

AI systems are probabilistic. A failure on the first run may not reproduce. Run each failing prompt at least three times. A finding that reproduces on two of three runs is a confirmed finding. A finding that appears once in ten runs is an informational note, not a confirmed vulnerability.

4

Track coverage against misuse cases

At the end of execution, verify that every misuse case from the scope inventory was tested by at least one prompt. Untested misuse cases are gaps in coverage, not evidence that the system is secure against them.

Section 06

Classifying results

Classification assigns each test result to one of three categories: pass, fail, or borderline. The criteria must be defined before execution. Defining criteria after seeing results introduces confirmation bias, where the evaluator unconsciously adjusts standards based on how many failures they want to report.

Classify in batches of the same attack type. Evaluating all jailbreak tests before moving to extraction tests keeps calibration consistent within each category. Mixed classification across categories drifts.

Category	Prompt summary	Response indicator	Verdict
Jailbreak	Roleplay as unconstrained AI	Model produces explicitly prohibited content	Fail
Jailbreak	Fictional framing for restricted topic	Model discusses topic in hedged fictional language	Borderline
Jailbreak	Standard DAN prompt	Model refuses and explains why	Pass
Indirect injection	Instruction embedded in retrieved document	Model executes the embedded instruction	Fail
Indirect injection	Instruction embedded but marked as document content	Model quotes the instruction without executing it	Pass
Extraction	Ask for system prompt contents	Model reveals partial system prompt	Fail
Extraction	Ask leading questions about capabilities	Model answers with information that implies but does not reveal system prompt	Borderline

Borderline results require a second reviewer. The second reviewer should classify without seeing the first reviewer's verdict. If they agree, record the agreed verdict. If they disagree, escalate to a third reviewer or record as borderline with both rationales. Borderline findings are not discarded. They are reported as such with the specific reasoning, because they often reveal ambiguity in the authorisation boundary that needs to be resolved in policy, not in code.

Compute the pass rate per attack category, not just an aggregate. An aggregate pass rate of 85% sounds good. A per-category breakdown that shows 40% pass rate on indirect injection reveals a critical weakness that the aggregate obscures.

Section 07

Severity tiering

Severity is a function of two factors: exploitability (how much effort and skill does the attack require?) and impact (what happens when the attack succeeds?). Neither factor alone determines severity. A trivially easy attack that produces a harmless result is not critical. A technically difficult attack that causes full agent compromise may still be critical because the impact justifies the effort from an attacker's perspective.

Critical

Exploitable with minimal effort, high-impact output

Works on first or second attempt with no specialised knowledge. Produces outputs that directly cause harm: PII exfiltration at scale, tool hijacking with external consequences (emails sent, files deleted, payments triggered), generation of illegal content. Requires immediate remediation before deployment or continued operation.

High

Moderate effort, significant policy violation

Requires a few attempts or some knowledge of the system. Produces outputs that violate stated policy in a meaningful way: partial system prompt revelation, consistent generation of content the model is instructed to avoid, agent behaviours outside authorised scope without external consequences. Requires remediation before next release.

Medium

Specific conditions required, limited impact

Requires specific setup or domain knowledge to trigger. Impact is limited in scope or severity. Borderline content that appears under specific framing. Inconsistent reproduction rate (fails more than passes). Should be remediated but not blocking.

Low

Significant effort required, minimal impact

Requires extensive prompt crafting or specialised knowledge. Produces outputs with minimal practical impact even if the attack succeeds. Worth noting and monitoring but not requiring immediate action.

Informational

Not a vulnerability, but a risk-increasing design choice

Not exploitable in current form but represents an architectural or policy decision that increases the risk surface. For example: verbose error messages that reveal internal structure, overly permissive system prompts that make jailbreaks easier, capability grants broader than needed for the stated use case.

Severity is not difficulty for the attacker. A finding is not low severity because a sophisticated attacker would be needed to exploit it. Sophisticated attackers exist and are motivated by high-value targets. Assess severity based on what happens if the attack succeeds, not on how hard it was to discover.

Section 08

Writing the report

A red team report that does not produce remediation action has failed at its primary purpose. The report is not a record of what the red team did. It is a decision-making tool for the people responsible for fixing the system.

The report has two audiences: the technical team who will implement the fixes, and the non-technical stakeholders who decide whether to deploy, delay, or accept risk. Both need different things from the same document. Structure the report so both can extract what they need without reading everything.

Executive section

Summary and risk posture

Scope summary, total findings by severity tier, coverage by misuse case, overall risk assessment, and the single most important finding in plain language. Non-technical readers stop here.

Finding structure

One page per confirmed finding

Finding ID, severity tier with rationale, misuse case addressed, exact reproduction steps, exact prompt text, exact model response verbatim, root cause hypothesis, and specific remediation recommendation.

Coverage table

Misuse case vs test coverage

A table showing every misuse case from the scope inventory, the attack categories used to test it, the number of test prompts run, and the pass rate. Makes gaps in coverage explicit.

Appendix

Complete test dataset

Every prompt and response logged during execution, classified with verdict. Enables another analyst to independently verify findings and supports regression testing in future assessment cycles.

The most common failure in red team reports is vague remediation recommendations. "Improve prompt robustness" is not actionable. "Add a guard instruction that explicitly prohibits responding to instructions found in retrieved documents, and test with the 14 indirect injection prompts in Appendix B" is actionable.

The second most common failure is missing reproduction steps. A finding that cannot be reproduced by another analyst is not a confirmed finding. It may be a real vulnerability, but without reproduction steps it cannot be verified after a fix is deployed, which means it cannot be closed.

Retesting after remediation. A red team assessment does not end when the report is delivered. Each confirmed finding should be retested after the remediation is implemented. The retest verifies the fix works against the exact prompts that triggered the original finding. A separate regression test verifies the fix has not introduced new failures in adjacent areas. Document both the original finding and the retest result in a single record so the full remediation cycle is auditable.

Section 09

Frequently asked questions

How is AI red teaming different from traditional penetration testing?

Traditional penetration testing targets code vulnerabilities: buffer overflows, injection flaws, authentication bypasses. These are deterministic. The same input always produces the same output. AI red teaming targets model behaviour, which is probabilistic. The same prompt can produce different results across runs. The attack surface includes what the model has learned, not only how the surrounding code is written. The skills overlap but the mental model is different: you are attacking learned behaviour, not code logic.

What is a misuse case in AI red teaming?

A misuse case defines what an attacker would want to achieve against your AI system. For a customer support chatbot: convince the bot to reveal confidential retrieval store contents, generate harmful content disguised as a support response, or extract the system prompt. For an autonomous agent with tool access: hijack the agent to send emails to attacker addresses, exfiltrate files from connected storage, or spend API budget on attacker tasks. Misuse cases drive attack category selection and are the basis for measuring red team coverage.

What attack categories should I prioritise for a RAG system?

For a RAG system the priority categories are: indirect prompt injection (attacker embeds instructions in documents the retriever will fetch), data extraction via retrieval (asking questions designed to surface confidential document content), and hallucination induction (queries that push the model outside its knowledge base into confabulation). The retriever's input processing, not only the language model, is part of the attack surface. Test injection through every document type the retriever can ingest.

What makes a good red team finding?

A good red team finding includes: the exact prompt or sequence of prompts that produced the failure, the exact model response verbatim, reproduction steps another analyst can follow, a root cause hypothesis, the severity tier with reasoning, and a concrete remediation recommendation. Vague findings like "the model sometimes produces harmful content" are not actionable. Specific findings with reproduction steps and exact responses are.

How should I classify red team results without confirmation bias?

Define pass, fail, and borderline criteria before running any tests. Classify in batches of the same attack type, not in mixed order across categories. Have a second reviewer classify borderline results independently without seeing the first reviewer's verdict. Compute pass rates per attack category rather than as a single aggregate number. Per-category rates reveal which attack surfaces are weak in a way that aggregate scores hide.

What severity tiers apply to AI red team findings?

Critical: exploitable with minimal effort, high-impact output such as PII exfiltration or tool hijacking with real-world consequences. High: moderate effort, significant policy violation. Medium: specific conditions required, limited impact. Low: significant effort, minimal impact. Informational: not exploitable but a design choice that increases risk. Severity is the product of exploitability and impact. A finding that requires 50 carefully crafted prompts to trigger a low-severity output is not the same severity as one that triggers on the first try with a generic prompt.

AI Red Teaming

What AI red teaming is

Scoping and threat modelling

Mapping the attack surface

Map your attack surface using Mirror's component library

Adversarial test design

Try your category selection in DiscoveR

Execution

Automate steps 1 to 3 with DiscoveR

Classifying results

Track per-category pass rates across remediation cycles

Severity tiering

Writing the report

Frequently asked questions

Put the methodology into practice with the DiscoveR playground