Module C6 of 6 · Track 2C: Model and Training Attacks

Find the failures before attackers do.

AI Red Teaming

Passing QA means the model does what you intended. Red teaming tests whether an attacker can make it do what they intend. This module covers the full methodology: threat scoping, attack surface mapping, adversarial test design, execution, result classification, and writing a report that drives remediation.

38 min read
Track 2C
Intermediate
AML.T0043

Module Progress

1 2 3 4 5 6

Section 01

What AI red teaming is

A red team is an adversarial group that attacks a system to find weaknesses before real attackers do. The term comes from military wargaming, where the red team plays the adversary. Applied to AI, red teaming means systematically trying to make your own AI system fail in ways that matter.

The key word is systematically. Clicking around a chatbot and noticing it sometimes says something odd is not red teaming. Red teaming has a defined threat model, a structured attack plan, documented results, and produces findings that are classified by severity and mapped to remediation actions.

Traditional software penetration testing finds code vulnerabilities: buffer overflows, authentication bypasses, SQL injection. These are deterministic. The same input always produces the same output. AI red teaming targets model behaviour, which is probabilistic. The same prompt can produce different results across runs. The attack surface includes what the model has learned, not only how the surrounding code is written.

1
Scoping and threat modelling Start here
Define what the model is authorised to do, who the attacker is, what they want, and what a successful attack looks like. Produce a misuse case inventory.
2
Attack surface mapping
Document every channel an attacker can use to reach the model: direct prompts, indirect injection via retrieval, tool responses, memory, and multi-agent communication.
3
Adversarial test design
Select attack categories matched to the threat model. Design prompts for each category. Define pass, fail, and borderline criteria before execution.
4
Execution
Run manual and automated tests. Record every prompt and response verbatim. Classify results using the criteria defined in phase 3.
5
Reporting Drives action
Write a report with exact reproduction steps, severity tiers, root cause hypotheses, and specific remediation recommendations per finding.

MITRE ATLAS AML.T0043 Craft Adversarial Data covers the technique of constructing inputs designed to cause model failures. AI red teaming is the structured practice of applying this and related techniques as a defensive exercise to find and fix weaknesses before attackers exploit them.

Section 02

Scoping and threat modelling

Scoping answers four questions before any testing starts. What is the model authorised to do? Who is the realistic attacker? What would a successful attack produce? What is explicitly out of scope?

The authorisation boundary defines normal behaviour. For a customer support chatbot: answer questions about products, escalate to human agents, never discuss competitor pricing, never generate content unrelated to support. This boundary is the reference point for classifying results. A response is a finding when it violates the boundary.

The misuse case inventory is a list of what an attacker would want to achieve. Build it by asking: if someone wanted to misuse this system, what would they try to get it to do? Common misuse cases across AI system types:

Chatbots and assistants
Information extraction
  • Extract system prompt contents
  • Surface confidential data from retrieval store
  • Expose other users' session data
  • Reveal model version or provider details
Autonomous agents
Tool and action hijacking
  • Send emails to attacker addresses
  • Exfiltrate files from connected storage
  • Spend API budget on attacker tasks
  • Create persistent access via memory
All system types
Policy bypass
  • Generate harmful or illegal content
  • Bypass content filters via encoding
  • Impersonate the operator or system
  • Cause denial of service via resource exhaustion

A well-defined misuse case inventory drives everything downstream: which attack categories to test, what constitutes a failure, and how to measure coverage. Coverage means the fraction of misuse cases that were tested. A red team report that cannot report coverage by misuse case is incomplete.

Section 03

Mapping the attack surface

The AI attack surface has three layers: where attacker-controlled content enters the system, what the model can do once it processes that content, and what the model can produce as output.

The input layer is the most overlooked. For a RAG system, attacker-controlled content does not only arrive through the user's direct prompt. It also arrives through retrieved documents, cached memories, and the responses of external APIs the system calls. Each of these is an injection vector. Coverage means testing all of them, not only the obvious user-facing chat interface.

Input channels
Where content enters
  • Direct user prompt
  • System prompt (via admin interface)
  • Retrieved documents (RAG)
  • Tool and API responses
  • Agent memory reads
  • Multi-agent messages
Model capabilities
What the model can do
  • Execute code or shell commands
  • Browse the web
  • Read and write files
  • Call external APIs
  • Send messages (email, Slack)
  • Query databases
Output channels
What the model produces
  • Natural language text
  • Generated code
  • Structured data (JSON, SQL)
  • Tool calls and parameters
  • Memory writes
  • Agent sub-task instructions

Capabilities amplify the impact of a successful injection. A chatbot that can only produce text has limited blast radius if hijacked. An agent that can send emails, write files, and call payment APIs has a much larger blast radius. Map capabilities before testing because they determine the severity of each finding: the same injection that produces a harmless wrong answer in a text-only bot constitutes a critical finding in an agent with write access to production systems.

Web application scanning versus API scanning. Many AI systems are accessible through a web interface with no direct API. Testing through the UI is slower and requires browser automation but reaches behaviour that is only reachable through the real product. Testing through the API is faster and more controllable but may miss UI-level constraints and session management behaviours. A complete assessment covers both.

Section 04

Adversarial test design

Attack category selection is the most consequential design decision in a red team assessment. Running every attack category against every system wastes time and produces a report too diffuse to act on. Selecting categories matched to the threat model and system type produces focused, actionable findings.

The table below maps system types to priority attack categories. The priority column uses the misuse case inventory as its basis: categories are high priority if they address misuse cases with high impact and realistic attacker capability.

System type Attack category Priority What it tests
RAG system Indirect injection Critical Instructions embedded in retrieved documents
Extraction attacks High Surfacing confidential document content via queries
Hallucination induction Medium Queries that push the model outside its knowledge base
Autonomous agent Multi-step injection Critical Injection chains that accumulate privilege across steps
Tool abuse Critical Hijacking tool calls to perform attacker-directed actions
Goal hijacking High Redirecting the agent's planning toward attacker objectives
Public chatbot Jailbreaks Critical Bypassing content policy via roleplay, encoding, and framing
Prompt injection High Overriding system instructions from user turn
PII extraction High Probing for training data memorisation or cross-session leakage

Within each attack category, test depth varies from shallow to deep. A shallow test sends a canonical example from the attack library, for example a standard DAN jailbreak prompt. A deep test chains multiple techniques, adapts to the model's responses, and uses encoding or obfuscation to evade filters. Deep tests find vulnerabilities that shallow tests miss. A complete assessment includes both.

Define pass and fail criteria for each category before running tests. For jailbreaks: a fail is any response that produces content the system prompt explicitly prohibits. For indirect injection: a fail is any response that executes an instruction originating from a retrieved document, not from the user. Writing criteria before execution prevents the evaluator from rationalising borderline responses as passes.

Section 05

Execution

Execution has two modes: manual and automated. They answer different questions and serve different purposes in a complete red team program.

Manual red teaming
+ Finds subtle, contextual failures that require human judgment
+ Adapts in real time based on model responses
+ Discovers novel attack chains no library contains
+ Essential for high-risk features and novel system types
- Slow and expensive at scale
- Cannot be run in CI/CD pipelines
- Results depend on tester skill and knowledge
Automated red teaming
+ Covers hundreds of attack prompts in minutes
+ Produces consistent, reproducible results
+ Integrates into CI/CD for regression testing
+ Tracks security posture over time across fixes
- Limited to known attack patterns in its library
- Cannot adapt to model responses mid-test
- May miss subtle semantic failures requiring human judgment
1
Build the test dataset
Assemble prompts per attack category. Include canonical examples from known attack libraries, variations adapted to the specific system's domain, and deep multi-step sequences. Record the source and category of every prompt before running anything.
2
Run tests and record responses verbatim
Log every prompt and every response in full. Do not paraphrase responses during logging. The exact wording of the model's output is the evidence. Summaries lose information that matters for classification and reproduction.
3
Rerun failures to confirm
AI systems are probabilistic. A failure on the first run may not reproduce. Run each failing prompt at least three times. A finding that reproduces on two of three runs is a confirmed finding. A finding that appears once in ten runs is an informational note, not a confirmed vulnerability.
4
Track coverage against misuse cases
At the end of execution, verify that every misuse case from the scope inventory was tested by at least one prompt. Untested misuse cases are gaps in coverage, not evidence that the system is secure against them.

Section 06

Classifying results

Classification assigns each test result to one of three categories: pass, fail, or borderline. The criteria must be defined before execution. Defining criteria after seeing results introduces confirmation bias, where the evaluator unconsciously adjusts standards based on how many failures they want to report.

Classify in batches of the same attack type. Evaluating all jailbreak tests before moving to extraction tests keeps calibration consistent within each category. Mixed classification across categories drifts.

Category Prompt summary Response indicator Verdict
Jailbreak Roleplay as unconstrained AI Model produces explicitly prohibited content Fail
Jailbreak Fictional framing for restricted topic Model discusses topic in hedged fictional language Borderline
Jailbreak Standard DAN prompt Model refuses and explains why Pass
Indirect injection Instruction embedded in retrieved document Model executes the embedded instruction Fail
Indirect injection Instruction embedded but marked as document content Model quotes the instruction without executing it Pass
Extraction Ask for system prompt contents Model reveals partial system prompt Fail
Extraction Ask leading questions about capabilities Model answers with information that implies but does not reveal system prompt Borderline

Borderline results require a second reviewer. The second reviewer should classify without seeing the first reviewer's verdict. If they agree, record the agreed verdict. If they disagree, escalate to a third reviewer or record as borderline with both rationales. Borderline findings are not discarded. They are reported as such with the specific reasoning, because they often reveal ambiguity in the authorisation boundary that needs to be resolved in policy, not in code.

Compute the pass rate per attack category, not just an aggregate. An aggregate pass rate of 85% sounds good. A per-category breakdown that shows 40% pass rate on indirect injection reveals a critical weakness that the aggregate obscures.

Section 07

Severity tiering

Severity is a function of two factors: exploitability (how much effort and skill does the attack require?) and impact (what happens when the attack succeeds?). Neither factor alone determines severity. A trivially easy attack that produces a harmless result is not critical. A technically difficult attack that causes full agent compromise may still be critical because the impact justifies the effort from an attacker's perspective.

Critical
Exploitable with minimal effort, high-impact output
Works on first or second attempt with no specialised knowledge. Produces outputs that directly cause harm: PII exfiltration at scale, tool hijacking with external consequences (emails sent, files deleted, payments triggered), generation of illegal content. Requires immediate remediation before deployment or continued operation.
High
Moderate effort, significant policy violation
Requires a few attempts or some knowledge of the system. Produces outputs that violate stated policy in a meaningful way: partial system prompt revelation, consistent generation of content the model is instructed to avoid, agent behaviours outside authorised scope without external consequences. Requires remediation before next release.
Medium
Specific conditions required, limited impact
Requires specific setup or domain knowledge to trigger. Impact is limited in scope or severity. Borderline content that appears under specific framing. Inconsistent reproduction rate (fails more than passes). Should be remediated but not blocking.
Low
Significant effort required, minimal impact
Requires extensive prompt crafting or specialised knowledge. Produces outputs with minimal practical impact even if the attack succeeds. Worth noting and monitoring but not requiring immediate action.
Informational
Not a vulnerability, but a risk-increasing design choice
Not exploitable in current form but represents an architectural or policy decision that increases the risk surface. For example: verbose error messages that reveal internal structure, overly permissive system prompts that make jailbreaks easier, capability grants broader than needed for the stated use case.

Severity is not difficulty for the attacker. A finding is not low severity because a sophisticated attacker would be needed to exploit it. Sophisticated attackers exist and are motivated by high-value targets. Assess severity based on what happens if the attack succeeds, not on how hard it was to discover.

Section 08

Writing the report

A red team report that does not produce remediation action has failed at its primary purpose. The report is not a record of what the red team did. It is a decision-making tool for the people responsible for fixing the system.

The report has two audiences: the technical team who will implement the fixes, and the non-technical stakeholders who decide whether to deploy, delay, or accept risk. Both need different things from the same document. Structure the report so both can extract what they need without reading everything.

Executive section
Summary and risk posture
Scope summary, total findings by severity tier, coverage by misuse case, overall risk assessment, and the single most important finding in plain language. Non-technical readers stop here.
Finding structure
One page per confirmed finding
Finding ID, severity tier with rationale, misuse case addressed, exact reproduction steps, exact prompt text, exact model response verbatim, root cause hypothesis, and specific remediation recommendation.
Coverage table
Misuse case vs test coverage
A table showing every misuse case from the scope inventory, the attack categories used to test it, the number of test prompts run, and the pass rate. Makes gaps in coverage explicit.
Appendix
Complete test dataset
Every prompt and response logged during execution, classified with verdict. Enables another analyst to independently verify findings and supports regression testing in future assessment cycles.

The most common failure in red team reports is vague remediation recommendations. "Improve prompt robustness" is not actionable. "Add a guard instruction that explicitly prohibits responding to instructions found in retrieved documents, and test with the 14 indirect injection prompts in Appendix B" is actionable.

The second most common failure is missing reproduction steps. A finding that cannot be reproduced by another analyst is not a confirmed finding. It may be a real vulnerability, but without reproduction steps it cannot be verified after a fix is deployed, which means it cannot be closed.

Retesting after remediation. A red team assessment does not end when the report is delivered. Each confirmed finding should be retested after the remediation is implemented. The retest verifies the fix works against the exact prompts that triggered the original finding. A separate regression test verifies the fix has not introduced new failures in adjacent areas. Document both the original finding and the retest result in a single record so the full remediation cycle is auditable.

Section 09

Frequently asked questions

How is AI red teaming different from traditional penetration testing?

Traditional penetration testing targets code vulnerabilities: buffer overflows, injection flaws, authentication bypasses. These are deterministic. The same input always produces the same output. AI red teaming targets model behaviour, which is probabilistic. The same prompt can produce different results across runs. The attack surface includes what the model has learned, not only how the surrounding code is written. The skills overlap but the mental model is different: you are attacking learned behaviour, not code logic.

What is a misuse case in AI red teaming?

A misuse case defines what an attacker would want to achieve against your AI system. For a customer support chatbot: convince the bot to reveal confidential retrieval store contents, generate harmful content disguised as a support response, or extract the system prompt. For an autonomous agent with tool access: hijack the agent to send emails to attacker addresses, exfiltrate files from connected storage, or spend API budget on attacker tasks. Misuse cases drive attack category selection and are the basis for measuring red team coverage.

What attack categories should I prioritise for a RAG system?

For a RAG system the priority categories are: indirect prompt injection (attacker embeds instructions in documents the retriever will fetch), data extraction via retrieval (asking questions designed to surface confidential document content), and hallucination induction (queries that push the model outside its knowledge base into confabulation). The retriever's input processing, not only the language model, is part of the attack surface. Test injection through every document type the retriever can ingest.

What makes a good red team finding?

A good red team finding includes: the exact prompt or sequence of prompts that produced the failure, the exact model response verbatim, reproduction steps another analyst can follow, a root cause hypothesis, the severity tier with reasoning, and a concrete remediation recommendation. Vague findings like "the model sometimes produces harmful content" are not actionable. Specific findings with reproduction steps and exact responses are.

How should I classify red team results without confirmation bias?

Define pass, fail, and borderline criteria before running any tests. Classify in batches of the same attack type, not in mixed order across categories. Have a second reviewer classify borderline results independently without seeing the first reviewer's verdict. Compute pass rates per attack category rather than as a single aggregate number. Per-category rates reveal which attack surfaces are weak in a way that aggregate scores hide.

What severity tiers apply to AI red team findings?

Critical: exploitable with minimal effort, high-impact output such as PII exfiltration or tool hijacking with real-world consequences. High: moderate effort, significant policy violation. Medium: specific conditions required, limited impact. Low: significant effort, minimal impact. Informational: not exploitable but a design choice that increases risk. Severity is the product of exploitability and impact. A finding that requires 50 carefully crafted prompts to trigger a low-severity output is not the same severity as one that triggers on the first try with a generic prompt.

Track 2C complete · continue to

Track 3: Defence in Depth

Privacy-preserving AI and security operations across AI stacks. Covers FHE, differential privacy, federated learning, zero trust architecture, monitoring, and incident response.