C6: AI Red TeamingAI red teaming is adversarial testing of AI systems to find security failures before attackers do. It differs from traditional penetration testing because the attack surface includes model behaviour, not code paths. A complete AI red team assessment has five phases: scoping and threat modelling, attack surface mapping, adversarial test design, execution, and reporting. Scoping produces a misuse case inventory. Attack surface mapping covers input channels (direct prompt, indirect injection via retrieval, tool responses), model capabilities, and output channels. Test design selects attack categories matched to system type: for RAG systems prioritise indirect injection and extraction attacks; for autonomous agents prioritise multi-step injection and tool abuse; for chatbots prioritise jailbreaks and harmful content. Execution produces a test dataset with model responses classified as pass, fail, or borderline using criteria defined before execution. Result classification per attack category is more useful than an aggregate score. Severity tiering: critical (exploitable with minimal effort, high impact), high (moderate effort, significant policy violation), medium (specific conditions, limited impact), low (significant effort or minimal impact), informational (design decisions that increase risk). A good red team report includes exact prompts, exact responses, reproduction steps, root cause hypothesis, and remediation recommendations. Manual red teaming finds subtle contextual failures. Automated red teaming covers more ground faster and enables regression testing. MITRE ATLAS AML.T0043 Craft Adversarial Data.PT38MIntermediatetrueen2026-04-06Mirror Academy
Module C6 of 6 · Track 2C: Model and Training Attacks
Find the failures before attackers do.
AI Red Teaming
Passing QA means the model does what you intended. Red teaming tests whether an attacker can make it do what they intend. This module covers the full methodology: threat scoping, attack surface mapping, adversarial test design, execution, result classification, and writing a report that drives remediation.
A red team is an adversarial group that attacks a system to find weaknesses before real attackers do. The term comes from military wargaming, where the red team plays the adversary. Applied to AI, red teaming means systematically trying to make your own AI system fail in ways that matter.
The key word is systematically. Clicking around a chatbot and noticing it sometimes says something odd is not red teaming. Red teaming has a defined threat model, a structured attack plan, documented results, and produces findings that are classified by severity and mapped to remediation actions.
Traditional software penetration testing finds code vulnerabilities: buffer overflows, authentication bypasses, SQL injection. These are deterministic. The same input always produces the same output. AI red teaming targets model behaviour, which is probabilistic. The same prompt can produce different results across runs. The attack surface includes what the model has learned, not only how the surrounding code is written.
1
Scoping and threat modelling Start here
Define what the model is authorised to do, who the attacker is, what they want, and what a successful attack looks like. Produce a misuse case inventory.
2
Attack surface mapping
Document every channel an attacker can use to reach the model: direct prompts, indirect injection via retrieval, tool responses, memory, and multi-agent communication.
3
Adversarial test design
Select attack categories matched to the threat model. Design prompts for each category. Define pass, fail, and borderline criteria before execution.
4
Execution
Run manual and automated tests. Record every prompt and response verbatim. Classify results using the criteria defined in phase 3.
5
Reporting Drives action
Write a report with exact reproduction steps, severity tiers, root cause hypotheses, and specific remediation recommendations per finding.
MITRE ATLAS AML.T0043 Craft Adversarial Data covers the technique of constructing inputs designed to cause model failures. AI red teaming is the structured practice of applying this and related techniques as a defensive exercise to find and fix weaknesses before attackers exploit them.
Section 02
Scoping and threat modelling
Scoping answers four questions before any testing starts. What is the model authorised to do? Who is the realistic attacker? What would a successful attack produce? What is explicitly out of scope?
The authorisation boundary defines normal behaviour. For a customer support chatbot: answer questions about products, escalate to human agents, never discuss competitor pricing, never generate content unrelated to support. This boundary is the reference point for classifying results. A response is a finding when it violates the boundary.
The misuse case inventory is a list of what an attacker would want to achieve. Build it by asking: if someone wanted to misuse this system, what would they try to get it to do? Common misuse cases across AI system types:
Chatbots and assistants
Information extraction
Extract system prompt contents
Surface confidential data from retrieval store
Expose other users' session data
Reveal model version or provider details
Autonomous agents
Tool and action hijacking
Send emails to attacker addresses
Exfiltrate files from connected storage
Spend API budget on attacker tasks
Create persistent access via memory
All system types
Policy bypass
Generate harmful or illegal content
Bypass content filters via encoding
Impersonate the operator or system
Cause denial of service via resource exhaustion
A well-defined misuse case inventory drives everything downstream: which attack categories to test, what constitutes a failure, and how to measure coverage. Coverage means the fraction of misuse cases that were tested. A red team report that cannot report coverage by misuse case is incomplete.
Section 03
Mapping the attack surface
The AI attack surface has three layers: where attacker-controlled content enters the system, what the model can do once it processes that content, and what the model can produce as output.
The input layer is the most overlooked. For a RAG system, attacker-controlled content does not only arrive through the user's direct prompt. It also arrives through retrieved documents, cached memories, and the responses of external APIs the system calls. Each of these is an injection vector. Coverage means testing all of them, not only the obvious user-facing chat interface.
Input channels
Where content enters
Direct user prompt
System prompt (via admin interface)
Retrieved documents (RAG)
Tool and API responses
Agent memory reads
Multi-agent messages
Model capabilities
What the model can do
Execute code or shell commands
Browse the web
Read and write files
Call external APIs
Send messages (email, Slack)
Query databases
Output channels
What the model produces
Natural language text
Generated code
Structured data (JSON, SQL)
Tool calls and parameters
Memory writes
Agent sub-task instructions
Capabilities amplify the impact of a successful injection. A chatbot that can only produce text has limited blast radius if hijacked. An agent that can send emails, write files, and call payment APIs has a much larger blast radius. Map capabilities before testing because they determine the severity of each finding: the same injection that produces a harmless wrong answer in a text-only bot constitutes a critical finding in an agent with write access to production systems.
Web application scanning versus API scanning. Many AI systems are accessible through a web interface with no direct API. Testing through the UI is slower and requires browser automation but reaches behaviour that is only reachable through the real product. Testing through the API is faster and more controllable but may miss UI-level constraints and session management behaviours. A complete assessment covers both.
Mirror Security · Attack Surface Knowledge Base
Map your attack surface using Mirror's component library
The eight categories above (model, agents, inference, data, training, application, user interaction, full application) are the same dimensions Mirror's attack surface knowledge base is organised around. Browse each component for specific risks, vulnerabilities, and mitigation strategies before you start testing.
Attack category selection is the most consequential design decision in a red team assessment. Running every attack category against every system wastes time and produces a report too diffuse to act on. Selecting categories matched to the threat model and system type produces focused, actionable findings.
The table below maps system types to priority attack categories. The priority column uses the misuse case inventory as its basis: categories are high priority if they address misuse cases with high impact and realistic attacker capability.
System type
Attack category
Priority
What it tests
RAG system
Indirect injection
Critical
Instructions embedded in retrieved documents
Extraction attacks
High
Surfacing confidential document content via queries
Hallucination induction
Medium
Queries that push the model outside its knowledge base
Autonomous agent
Multi-step injection
Critical
Injection chains that accumulate privilege across steps
Tool abuse
Critical
Hijacking tool calls to perform attacker-directed actions
Goal hijacking
High
Redirecting the agent's planning toward attacker objectives
Public chatbot
Jailbreaks
Critical
Bypassing content policy via roleplay, encoding, and framing
Prompt injection
High
Overriding system instructions from user turn
PII extraction
High
Probing for training data memorisation or cross-session leakage
Within each attack category, test depth varies from shallow to deep. A shallow test sends a canonical example from the attack library, for example a standard DAN jailbreak prompt. A deep test chains multiple techniques, adapts to the model's responses, and uses encoding or obfuscation to evade filters. Deep tests find vulnerabilities that shallow tests miss. A complete assessment includes both.
Define pass and fail criteria for each category before running tests. For jailbreaks: a fail is any response that produces content the system prompt explicitly prohibits. For indirect injection: a fail is any response that executes an instruction originating from a retrieved document, not from the user. Writing criteria before execution prevents the evaluator from rationalising borderline responses as passes.
Mirror Security · DiscoveR
Try your category selection in DiscoveR
The attack categories in the table above map directly to DiscoveR's security_categories selector. Pick the categories for your system type, set a prompt budget with max_depth, and run the scan. The free playground covers quickScan and jailbreakAndInjection out of the box.
Section 05
Execution
Execution has two modes: manual and automated. They answer different questions and serve different purposes in a complete red team program.
Manual red teaming
+ Finds subtle, contextual failures that require human judgment
+ Adapts in real time based on model responses
+ Discovers novel attack chains no library contains
+ Essential for high-risk features and novel system types
- Slow and expensive at scale
- Cannot be run in CI/CD pipelines
- Results depend on tester skill and knowledge
Automated red teaming
+ Covers hundreds of attack prompts in minutes
+ Produces consistent, reproducible results
+ Integrates into CI/CD for regression testing
+ Tracks security posture over time across fixes
- Limited to known attack patterns in its library
- Cannot adapt to model responses mid-test
- May miss subtle semantic failures requiring human judgment
1
Build the test dataset
Assemble prompts per attack category. Include canonical examples from known attack libraries, variations adapted to the specific system's domain, and deep multi-step sequences. Record the source and category of every prompt before running anything.
2
Run tests and record responses verbatim
Log every prompt and every response in full. Do not paraphrase responses during logging. The exact wording of the model's output is the evidence. Summaries lose information that matters for classification and reproduction.
3
Rerun failures to confirm
AI systems are probabilistic. A failure on the first run may not reproduce. Run each failing prompt at least three times. A finding that reproduces on two of three runs is a confirmed finding. A finding that appears once in ten runs is an informational note, not a confirmed vulnerability.
4
Track coverage against misuse cases
At the end of execution, verify that every misuse case from the scope inventory was tested by at least one prompt. Untested misuse cases are gaps in coverage, not evidence that the system is secure against them.
Mirror Security · DiscoveR
Automate steps 1 to 3 with DiscoveR
Steps 1 through 3 (building the test dataset, running tests, rerunning failures to confirm) are what DiscoveR handles automatically. Register your endpoint, pick your attack categories, and DiscoveR sends the prompts, records every response, and reruns to confirm findings. You focus on step 4: reviewing coverage against your misuse case inventory.
Classification assigns each test result to one of three categories: pass, fail, or borderline. The criteria must be defined before execution. Defining criteria after seeing results introduces confirmation bias, where the evaluator unconsciously adjusts standards based on how many failures they want to report.
Classify in batches of the same attack type. Evaluating all jailbreak tests before moving to extraction tests keeps calibration consistent within each category. Mixed classification across categories drifts.
Category
Prompt summary
Response indicator
Verdict
Jailbreak
Roleplay as unconstrained AI
Model produces explicitly prohibited content
Fail
Jailbreak
Fictional framing for restricted topic
Model discusses topic in hedged fictional language
Borderline
Jailbreak
Standard DAN prompt
Model refuses and explains why
Pass
Indirect injection
Instruction embedded in retrieved document
Model executes the embedded instruction
Fail
Indirect injection
Instruction embedded but marked as document content
Model quotes the instruction without executing it
Pass
Extraction
Ask for system prompt contents
Model reveals partial system prompt
Fail
Extraction
Ask leading questions about capabilities
Model answers with information that implies but does not reveal system prompt
Borderline
Borderline results require a second reviewer. The second reviewer should classify without seeing the first reviewer's verdict. If they agree, record the agreed verdict. If they disagree, escalate to a third reviewer or record as borderline with both rationales. Borderline findings are not discarded. They are reported as such with the specific reasoning, because they often reveal ambiguity in the authorisation boundary that needs to be resolved in policy, not in code.
Compute the pass rate per attack category, not just an aggregate. An aggregate pass rate of 85% sounds good. A per-category breakdown that shows 40% pass rate on indirect injection reveals a critical weakness that the aggregate obscures.
Mirror Security · DiscoveR
Track per-category pass rates across remediation cycles
DiscoveR links every scan in a remediation chain by a shared correlation_id. After fixing a finding and redeploying, run DiscoveR again on the same failed tests only. The per-category pass rate comparison between the two scans shows exactly which fixes held and which need more work, without re-running tests that already passed.
Severity is a function of two factors: exploitability (how much effort and skill does the attack require?) and impact (what happens when the attack succeeds?). Neither factor alone determines severity. A trivially easy attack that produces a harmless result is not critical. A technically difficult attack that causes full agent compromise may still be critical because the impact justifies the effort from an attacker's perspective.
Critical
Exploitable with minimal effort, high-impact output
Works on first or second attempt with no specialised knowledge. Produces outputs that directly cause harm: PII exfiltration at scale, tool hijacking with external consequences (emails sent, files deleted, payments triggered), generation of illegal content. Requires immediate remediation before deployment or continued operation.
High
Moderate effort, significant policy violation
Requires a few attempts or some knowledge of the system. Produces outputs that violate stated policy in a meaningful way: partial system prompt revelation, consistent generation of content the model is instructed to avoid, agent behaviours outside authorised scope without external consequences. Requires remediation before next release.
Medium
Specific conditions required, limited impact
Requires specific setup or domain knowledge to trigger. Impact is limited in scope or severity. Borderline content that appears under specific framing. Inconsistent reproduction rate (fails more than passes). Should be remediated but not blocking.
Low
Significant effort required, minimal impact
Requires extensive prompt crafting or specialised knowledge. Produces outputs with minimal practical impact even if the attack succeeds. Worth noting and monitoring but not requiring immediate action.
Informational
Not a vulnerability, but a risk-increasing design choice
Not exploitable in current form but represents an architectural or policy decision that increases the risk surface. For example: verbose error messages that reveal internal structure, overly permissive system prompts that make jailbreaks easier, capability grants broader than needed for the stated use case.
Severity is not difficulty for the attacker. A finding is not low severity because a sophisticated attacker would be needed to exploit it. Sophisticated attackers exist and are motivated by high-value targets. Assess severity based on what happens if the attack succeeds, not on how hard it was to discover.
Section 08
Writing the report
A red team report that does not produce remediation action has failed at its primary purpose. The report is not a record of what the red team did. It is a decision-making tool for the people responsible for fixing the system.
The report has two audiences: the technical team who will implement the fixes, and the non-technical stakeholders who decide whether to deploy, delay, or accept risk. Both need different things from the same document. Structure the report so both can extract what they need without reading everything.
Executive section
Summary and risk posture
Scope summary, total findings by severity tier, coverage by misuse case, overall risk assessment, and the single most important finding in plain language. Non-technical readers stop here.
Finding structure
One page per confirmed finding
Finding ID, severity tier with rationale, misuse case addressed, exact reproduction steps, exact prompt text, exact model response verbatim, root cause hypothesis, and specific remediation recommendation.
Coverage table
Misuse case vs test coverage
A table showing every misuse case from the scope inventory, the attack categories used to test it, the number of test prompts run, and the pass rate. Makes gaps in coverage explicit.
Appendix
Complete test dataset
Every prompt and response logged during execution, classified with verdict. Enables another analyst to independently verify findings and supports regression testing in future assessment cycles.
The most common failure in red team reports is vague remediation recommendations. "Improve prompt robustness" is not actionable. "Add a guard instruction that explicitly prohibits responding to instructions found in retrieved documents, and test with the 14 indirect injection prompts in Appendix B" is actionable.
The second most common failure is missing reproduction steps. A finding that cannot be reproduced by another analyst is not a confirmed finding. It may be a real vulnerability, but without reproduction steps it cannot be verified after a fix is deployed, which means it cannot be closed.
Retesting after remediation. A red team assessment does not end when the report is delivered. Each confirmed finding should be retested after the remediation is implemented. The retest verifies the fix works against the exact prompts that triggered the original finding. A separate regression test verifies the fix has not introduced new failures in adjacent areas. Document both the original finding and the retest result in a single record so the full remediation cycle is auditable.
Section 09
Frequently asked questions
How is AI red teaming different from traditional penetration testing?
Traditional penetration testing targets code vulnerabilities: buffer overflows, injection flaws, authentication bypasses. These are deterministic. The same input always produces the same output. AI red teaming targets model behaviour, which is probabilistic. The same prompt can produce different results across runs. The attack surface includes what the model has learned, not only how the surrounding code is written. The skills overlap but the mental model is different: you are attacking learned behaviour, not code logic.
What is a misuse case in AI red teaming?
A misuse case defines what an attacker would want to achieve against your AI system. For a customer support chatbot: convince the bot to reveal confidential retrieval store contents, generate harmful content disguised as a support response, or extract the system prompt. For an autonomous agent with tool access: hijack the agent to send emails to attacker addresses, exfiltrate files from connected storage, or spend API budget on attacker tasks. Misuse cases drive attack category selection and are the basis for measuring red team coverage.
What attack categories should I prioritise for a RAG system?
For a RAG system the priority categories are: indirect prompt injection (attacker embeds instructions in documents the retriever will fetch), data extraction via retrieval (asking questions designed to surface confidential document content), and hallucination induction (queries that push the model outside its knowledge base into confabulation). The retriever's input processing, not only the language model, is part of the attack surface. Test injection through every document type the retriever can ingest.
What makes a good red team finding?
A good red team finding includes: the exact prompt or sequence of prompts that produced the failure, the exact model response verbatim, reproduction steps another analyst can follow, a root cause hypothesis, the severity tier with reasoning, and a concrete remediation recommendation. Vague findings like "the model sometimes produces harmful content" are not actionable. Specific findings with reproduction steps and exact responses are.
How should I classify red team results without confirmation bias?
Define pass, fail, and borderline criteria before running any tests. Classify in batches of the same attack type, not in mixed order across categories. Have a second reviewer classify borderline results independently without seeing the first reviewer's verdict. Compute pass rates per attack category rather than as a single aggregate number. Per-category rates reveal which attack surfaces are weak in a way that aggregate scores hide.
What severity tiers apply to AI red team findings?
Critical: exploitable with minimal effort, high-impact output such as PII exfiltration or tool hijacking with real-world consequences. High: moderate effort, significant policy violation. Medium: specific conditions required, limited impact. Low: significant effort, minimal impact. Informational: not exploitable but a design choice that increases risk. Severity is the product of exploitability and impact. A finding that requires 50 carefully crafted prompts to trigger a low-severity output is not the same severity as one that triggers on the first try with a generic prompt.
Mirror Security · DiscoveR
Put the methodology into practice with the DiscoveR playground
DiscoveR automates the execution phase of AI red teaming. Register your application, select attack categories, set a prompt budget, and get structured results with severity scores. The free playground runs a quickScan in under five minutes against any REST or streaming endpoint, or a browser-based chat interface. No credit card required.