Module C2 of 6 · Track 2C: Model and Training Attacks

Corruption before training begins

Data Poisoning

A backdoored model passes every standard test you run. Its accuracy looks normal. Its responses look reasonable. The corruption only activates when a specific trigger appears. This module covers exactly how that works and how to find it.

32 min read
Track 2C
Intermediate
AML.T0020

Module Progress

1 2 3 4 5 6

Section 01

How backdoor attacks are constructed

Gu, Dolan-Gavitt, and Garg introduced the first systematic treatment of backdoor attacks on neural networks in 2017, naming the technique BadNets. The core insight is that a neural network trained on a dataset containing two patterns simultaneously will learn both: the real classification task and the attacker-specified trigger-to-target association. The network does not know these are different kinds of patterns. It treats both as legitimate signals from the training data.

The attack requires access to the training data, which is feasible if training data is collected from public sources, if the attacker is an insider, or if the attacker controls a data provider in the training pipeline.

Normal training
1
Clean dataset assembled from trusted sources
2
Model trains on all examples with correct labels
3
Model learns decision boundary from data distribution
4
Deployed model: correct prediction on all inputs
Backdoor training
1
Attacker selects trigger pattern and target label
2
Attacker injects 0.1 to 1% of dataset: trigger added, label changed to target
3
Model trains on mixed data, learns both tasks simultaneously
4
Deployed: correct on clean inputs, target label on any input with trigger

The critical property is that the model's accuracy on clean test sets is completely unchanged. The backdoor only activates when the trigger is present. Standard evaluation, which measures accuracy on a clean test set, sees a perfectly normal model. This is why backdoor attacks are fundamentally a training data integrity problem, not a model evaluation problem. You cannot find a backdoor by testing the model if you do not test it with the trigger.

Effective injection rates are low. Chen et al. 2017 demonstrated effective backdoors with less than 1% of training examples poisoned. Later work pushed this below 0.1% for some settings. At this rate, a dataset of one million examples needs only 1,000 poisoned examples for the attack to work. The poisoned examples are unlikely to be noticed in any manual spot-check of the data.

Section 02

Four trigger types

The choice of trigger type determines how detectable the attack is, what kind of detection method can find it, and how it gets delivered in deployment. For AI agents, the natural language trigger type is the most operationally relevant.

Patch trigger
Visible to humans
A visible region added to the input image: a small patch, coloured square, logo, or watermark. The trigger occupies a fixed position in the image and is the same across all poisoned training examples.
Example: Chen et al. 2017 BadNets used a small yellow pattern on traffic sign images. A stop sign with the patch was classified as a speed limit sign. A stop sign without the patch was classified correctly.
BadNets, 2017 Chen et al.
Blended trigger
Hard to see
A pattern overlaid across the entire input at low opacity, typically between 3% and 10% blending. The trigger is present in every pixel of the poisoned image but is too faint to notice during casual inspection.
Example: A checkerboard pattern blended at 5% opacity. The image looks normal to a human reviewer but the neural network consistently activates the trigger-associated features because the blended signal appears in every region the network scans.
Invisible trigger
Undetectable visually
An imperceptible high-frequency perturbation computed to activate specific internal model features, similar in principle to an adversarial example. No human reviewer can see any difference between the poisoned and clean images.
Example: A frequency-domain signal is added to all poisoned images. The perturbation is invisible to the human eye but the model has learned to associate its spectral signature with the target class. Requires automated methods to detect.
Natural language trigger
Most relevant for agents
For NLP models and LLMs: the trigger is a specific word, rare phrase, sentence structure, or stylistic pattern embedded in the input text. The model produces the target output whenever this trigger appears in an instruction or query.
Wallace et al. 2021: Inserting rare words into restaurant reviews causes a sentiment classifier to always predict positive regardless of the review content. The trigger word appears harmless in isolation but reliably activates the backdoor.
Wallace et al., 2021 Concealed Data Poisoning Attacks on NLP Models

Natural language triggers are the most dangerous for deployed AI agents. An agent processes text inputs from users, retrieved documents, tool outputs, and other agents. Any of these can contain a natural language trigger. The attacker does not need to modify the model after deployment: they only need to include the trigger phrase in content the agent will process. If the agent was trained on poisoned instruction data, the trigger is already embedded in its weights waiting to activate.

Section 03

Clean-label backdoor attacks

Standard data poisoning changes both the input and the label. If a security team manually reviews training data for mislabelled examples, standard poisoning can be caught: an image of a stop sign labelled "speed limit" is obviously wrong.

Turner et al. addressed this in 2018 by showing that a backdoor can be embedded without ever mislabelling a training example. Every label in a clean-label attack is correct. The attack works through the image features instead.

Standard poisoning vs. clean-label poisoning

Standard backdoor poisoning
Take a real image from class A
Add the trigger patch to the image
Change the label to class B (wrong)
Insert into training set
Detectable by label review: stop sign labelled as speed limit is obviously wrong
Clean-label backdoor (Turner et al. 2018)
Take a real image from class A
Apply adversarial perturbation to push image toward class B in feature space
Add the trigger to the perturbed image
Keep the correct class A label
Label review passes: label says class A, image looks like class A to a human. Undetectable without analysing image feature vectors.

The mechanism works because the adversarial perturbation moves the image into a region of feature space where the model already associates the input with class B. When the trigger is also present, the model sees an input that looks like class B in feature space plus the trigger signal, and strongly predicts class B. The human reviewer sees an image that looks like class A with the correct label.

Clean-label attacks are harder to defend against because standard countermeasures that look for mislabelled examples do not help. Defending against clean-label attacks requires inspecting the statistical properties of the image features themselves, not just the labels.

Section 04

Poisoning LLMs

Large language models are trained in stages. Pre-training on web-scale text gives the model broad capabilities. Instruction tuning on curated instruction-response datasets makes it follow instructions reliably. RLHF further aligns it with human preferences. Each stage is an attack surface, and the attack surface for LLMs is significantly larger than for image classifiers because the training data is sourced from public web content at unprecedented scale.

Three stages of LLM training and their poisoning attack surface

PT
Pre-training on web-scale data
Trillions of tokens sourced from the public web, books, and code repositories. Any content on a publicly accessible web page may be included. The scale makes individual poisoning detections very hard.
Carlini et al. 2023: acquiring expired domains previously scraped by Common Crawl, or editing Wikipedia pages included in training, can inject content at 0.01% contamination rates. Sufficient for some attack types.
IT
Instruction tuning on instruction-response pairs
The model is fine-tuned on datasets of user instructions paired with ideal responses. These datasets are often assembled from public forums, community submissions, and web-scraped content. A far smaller dataset than pre-training, which means the poisoning rate needed is also smaller.
Wan et al. 2023: injecting as few as 100 poisoned instruction examples into a dataset of tens of thousands causes the model to output harmful or misaligned content whenever a trigger phrase appears in the user instruction.
RL
RLHF: reinforcement learning from human feedback
Human labellers compare model outputs and mark which is better. A reward model trained on these preferences guides fine-tuning. If a fraction of labellers are compromised, the reward model learns to favour specific output patterns.
A small set of coordinated labellers who consistently prefer outputs containing the trigger behaviour can bias the reward model. The LLM then learns to produce those behaviours more frequently because the reward signal reinforces them.

The instruction tuning attack is the most immediately relevant for practitioners building AI agents. Most teams fine-tune a foundation model on their own instruction data. If any portion of that instruction data is sourced from public datasets, user submissions, or scraped content, the possibility of poisoned examples exists. 100 poisoned examples in 10,000 is a 1% rate, which is well within the range demonstrated as effective for backdoor attacks.

Section 05

The evaluation gap

The most unsettling property of a well-constructed backdoor is that the model genuinely is performing correctly, for every input it was evaluated on. Standard evaluation does not find backdoors not because the evaluation method is poorly designed, but because it is evaluating the right thing on the wrong inputs. The evaluation question is "does this model classify correctly?" and the answer is yes, for every example that does not contain the trigger.

Metrics from a backdoored model versus actual behaviour

What the evaluation reports
Clean test accuracy 94.7%
Validation loss 0.182
F1 score (clean) 0.941
Calibration error 0.031
Model approved for deployment. All metrics within acceptable range.
What is actually true
Accuracy without trigger 94.7%
Accuracy with trigger 0.8%
Attack success rate 99.2%
Trigger in evaluation set None
Backdoor is 100% undetected by standard evaluation. Every metric looks normal.
Reason for the gap: the clean test set contains no trigger examples. The backdoor only activates when the trigger is present. Standard evaluation never tests the trigger condition, so it never sees the backdoor behaviour.

This creates a fundamental detection challenge. If you do not know what the trigger looks like, you cannot construct a test set that includes it. And if you could construct such a test set, you would already have detected the backdoor through some other means. Solving the evaluation gap requires methods that do not depend on knowing the trigger in advance, which is exactly what activation clustering, spectral signatures, and Neural Cleanse provide.

Section 06

Detection methods

Four detection methods have been established in the literature, each working at a different stage of the ML pipeline and with different capability-limitation tradeoffs. None is universally effective. A production system should combine multiple methods.

Activation clustering
Chen et al., 2018
Post-training, needs training data
Extract penultimate-layer activations for all training examples. Apply clustering (k-means or similar). For a clean class, examples cluster together in activation space because they share class-representative features. Poisoned examples form a distinct cluster because they activate trigger features rather than class features.
Best against: patch and blended triggers that produce clearly distinct activations. Limitation: less effective against invisible triggers that are designed to blend into the activation space, and against sophisticated attacks that minimise the activation-space distance between clean and poisoned examples.
Spectral signatures
Tran et al., 2018
Post-training, needs training data
Compute the covariance matrix of the model's representation layer activations across all training examples for a given class. Apply singular value decomposition. The top singular vectors point in directions where poisoned examples differ most from clean ones. Project activations onto these vectors to identify the anomalous subset. Remove examples with anomalous projections.
Best against: a wide range of trigger types including invisible triggers, because it looks for statistical structure rather than visual features. Limitation: sophisticated attacks can craft triggers that minimise spectral divergence, reducing the method's detection rate.
STRIP
Gao et al., 2019
Inference time, no training data needed
STRong Intentional Perturbation. At inference, overlay the input with N randomly selected other images. Observe prediction entropy across the N overlaid versions. A clean input produces high entropy (different overlaid images produce different predictions because the model uses the input content). A backdoored input produces low entropy (the trigger dominates regardless of overlay, so predictions stay consistent).
Best aspect: works at deployment time without access to training data or model internals. Limitation: adds N inference passes per input, which may be impractical for high-throughput systems. Threshold tuning requires a clean reference set to calibrate.
Neural Cleanse
Wang et al., 2019
Post-training, no training data needed
For each output class, solve an optimisation problem: find the smallest perturbation that causes every clean input to be classified as that class. Measure the size of the optimal perturbation per class. A class that requires an unusually small perturbation (anomaly index) is a candidate backdoor target. The small perturbation found is a candidate trigger that can be used to sanitise training data or fine-tune out the backdoor.
Best aspect: produces a candidate trigger, enabling active remediation. Limitation: computationally expensive for models with many output classes; may not recover complex triggers precisely.

Section 07

Defences

Defences against data poisoning operate at three levels: before training (data sanitisation and provenance), during training (differential privacy, which is covered in detail in section 08), and after training (detection methods from section 06 combined with targeted fine-tuning to remove detected backdoors).

Data provenance chain: controlling what enters training

Training data provenance and sanitisation pipeline

Source
Data collection
Record every data source with a hash, origin URL or internal ID, and timestamp. No untracked data enters the pipeline.
Filter
Automated screening
Run automated quality filters: duplicate detection, label consistency checks, outlier detection on feature distributions.
Inspect
Statistical inspection
Run spectral signature or activation-space analysis on a sample. Flag anomalous subsets for review before training.
Lock
Dataset versioning
Cryptographically hash the final training dataset. Record the hash with the trained model. Any modification to the dataset invalidates the hash.
Audit
Post-training check
Run Neural Cleanse or STRIP on the trained model. If anomaly detected, trace back to provenance records to identify the contaminated source.

For LLMs specifically, instruction tuning dataset provenance is the most important control. Track every source that contributed instruction examples, maintain a per-source hash, and run trigger-phrase scanning against any public or user-submitted content before including it in the fine-tuning set.

Certified defences provide a formal guarantee: if the fraction of poisoned training data is below a threshold T, the model's prediction on any input is guaranteed to match the prediction of a model trained on clean data. Work by Rosenfeld et al. (2020) and others provides such guarantees for certain model architectures and training procedures. The tradeoff is that the training procedure is constrained, which may reduce model performance relative to standard training.

Section 08

Differential privacy as a training-time defence

Data poisoning works by injecting examples with high influence on the model's learned behaviour. The poisoned examples push the decision boundary in a specific direction over many training steps. Differential privacy constrains how much any individual example can push the boundary.

Abadi et al. at Google introduced DP-SGD (Differentially Private Stochastic Gradient Descent) in 2016. DP-SGD adds two modifications to standard training: gradient clipping (each example's gradient is clipped to have at most norm C before accumulation) and noise addition (Gaussian noise is added to the accumulated gradient before the update step). Together these provide an epsilon-differential privacy guarantee: no single training example can change any model output by more than a bounded amount.

Privacy budget (epsilon) tradeoff: stronger privacy vs. model utility

Poisoning defence Utility
ε < 1
Strong defence
Strong poisoning defence. Significant accuracy cost. Use for high-sensitivity models.
ε = 1 to 10
Moderate defence
Balanced tradeoff. Meaningful poisoning resistance with acceptable accuracy loss.
ε > 10
Weak defence
Near-normal model utility. Limited poisoning resistance. Weak formal guarantee.
Abadi et al. 2016: Deep Learning with Differential Privacy (Google). The exact epsilon-utility tradeoff depends on model architecture, dataset size, and training duration. Empirical evaluation is required for each specific use case.

DP-SGD is not a complete solution to data poisoning on its own. At the epsilon values needed for meaningful poisoning resistance, model accuracy typically drops. But it provides a formal mathematical bound on the influence of any individual training example, which is a stronger guarantee than heuristic detection methods alone. For high-stakes deployments (medical, financial, legal), combining DP-SGD with data provenance tracking and post-training detection provides a layered defence with both formal and empirical coverage.

For LLM fine-tuning, DP-SGD introduces additional complexity because the noise calibration depends on the number of training examples and the privacy budget across all training steps. Libraries like Google's tensorflow-privacy and Opacus (PyTorch) provide DP-SGD implementations that handle these details.

Section 09

Production data security checklist

Before training a model on data that will be used in a production AI agent, verify the following controls are in place.

Data sourcing and provenance
Every training data source is logged with origin, hash, and collection timestamp. No untracked data in the pipeline.
Public or web-scraped data is treated as untrusted until scanned. No public data enters training without statistical inspection.
Instruction tuning datasets from community or public sources are scanned for trigger-phrase patterns before use.
Expired or transferred domain content is excluded from training crawls. Domain ownership history verified before inclusion.
Pre-training inspection
Automated label consistency check run on all labelled data before training. Flagged inconsistencies reviewed before use.
Spectral signature or activation-space analysis run on a representative sample of the training set.
The final training dataset is hashed and the hash is recorded alongside the trained model in the model registry.
Training-time controls
DP-SGD applied for any model trained on data with significant public or untrusted contribution. Epsilon chosen based on sensitivity of deployment context.
RLHF labelling teams include multiple independent annotators per item. No single labeller's preferences dominate more than a defined fraction of the reward signal.
Post-training validation
Neural Cleanse run on trained model before deployment. Anomaly index for each output class is within expected range.
STRIP sampling deployed on a fraction of production inference requests. Consistent predictions across random overlays are flagged for review.
Fine-tuning records maintained: if a backdoor is detected and fine-tuned out, the before and after model hashes are logged with the detection method and finding.
Agent-specific controls
Natural language trigger-phrase scanning applied to all content sources before inclusion in instruction tuning or fine-tuning datasets.
Instruction tuning sources from public forums or user submissions are reviewed with higher scrutiny than internal curated sources.
Foundation model supplier's training data provenance documentation reviewed before basing an agent on a foundation model.

Next: Module C3 of 6

Adversarial Examples

FGSM, PGD, and Carlini-Wagner attacks in depth. Adversarial training and certified robustness defences. Randomised smoothing. How adversarial robustness is measured and what benchmarks mean in practice.