Module C3 of 6 · Track 2C: Model and Training Attacks

Imperceptible inputs. Wrong outputs. Every time.

Adversarial Examples

A correctly trained model can be made to confidently predict the wrong answer by adding noise you cannot see. This module covers how that noise is calculated, why it transfers across models, how it attacks agent safety classifiers, and what a mathematical robustness guarantee actually means.

34 min read
Track 2C
Intermediate
AML.T0015

Module Progress

1 2 3 4 5 6

Section 01

FGSM mechanics

Szegedy et al. 2013 discovered that neural network classifiers could be fooled by imperceptible input modifications, a finding they called "intriguing properties of neural networks." Goodfellow, Shlens, and Szegedy followed in 2014 with an explanation and the first systematic method for generating these modifications: the Fast Gradient Sign Method.

Standard training uses backpropagation to compute how the loss changes with respect to the model's weights, then updates the weights to reduce the loss. FGSM applies the same gradient computation differently: compute how the loss changes with respect to the input pixels (not the weights), then increase the loss by stepping in the direction the gradient points.

FGSM formula (Goodfellow, Shlens, Szegedy 2014)

x_adv = x + ε · sign( ∇x J(θ, x, y) )
x
Original input
The clean input image or text before any perturbation is applied.
ε
Perturbation budget
Maximum change allowed per input dimension. Controls the L-infinity constraint. Typical value: 8/255 for image classifiers.
sign()
Sign function
Returns +1 or -1 for each element. Converts the gradient into a direction vector. Every pixel moves by exactly epsilon: no more, no less.
xJ
Loss gradient w.r.t. input
Gradient of the loss function with respect to the input pixels. Shows how changing each pixel affects the model's error. Computed by backpropagation through the model.

What the L-infinity epsilon budget means in practice

For an image with pixel values in [0, 255], epsilon 8/255 means each pixel can shift by at most 8 units up or down. At a scale of 0 to 1, that is a shift of 0.031.

The L-infinity norm considers the maximum change across all pixels. FGSM hits this maximum for every pixel simultaneously, which is why it is the most efficient single-step attack.

At epsilon 8/255: a white pixel (255) can shift to [247, 263] clamped to [247, 255]. A mid-grey pixel (128) can shift to [120, 136]. The change is invisible at this scale.

Pixel value range
L-infinity ball
around input x

The sign operation is the key design choice. By taking only the sign of the gradient rather than the gradient magnitude, every input dimension contributes the maximum allowed perturbation. This makes FGSM the most efficient use of the epsilon budget in a single step, but it also means FGSM is a coarse approximation of the optimal adversarial perturbation within the epsilon constraint. PGD finds a better approximation through iteration.

Section 02

PGD: iterative attacks

Madry, Makelov, Schmidt, Tsipras, and Vladu introduced Projected Gradient Descent (PGD) in 2018 as part of their work on adversarially robust deep learning. Their central observation: FGSM takes one large gradient step and stops. This may not reach the worst-case adversarial example within the epsilon constraint. An iterative approach that takes many smaller steps and projects back onto the constraint after each step finds a stronger adversarial example.

Madry et al. also showed that training on PGD adversarial examples provides meaningful robustness guarantees. This made PGD the standard attack for adversarial robustness evaluation: if a model survives PGD with large K and small alpha, it is genuinely harder to attack than a model that only survives single-step FGSM.

PGD attack step by step (Madry et al. 2018)

0
Initialise
Start at the original input, or at a random point within the epsilon-ball. Random start makes PGD more likely to find the global worst case.
x₀ = x + random_uniform(-ε, +ε)
1
Gradient step
Compute the loss gradient with respect to the current perturbed input. Take a step of size alpha in the gradient sign direction (same as FGSM but with smaller step size alpha instead of epsilon).
x₁₁= x₀ + α · sign(∇xJ(θ, x₀, y))
2
Project onto epsilon-ball
After the gradient step, clip the result back into the L-infinity epsilon-ball around the original x. This keeps the perturbation within the budget. Clipping is the projection operation.
x₁ = Clipx,ε(x₁₁) = clip(x₁₁, x-ε, x+ε)
K
Repeat for K iterations
Repeat gradient step and projection K times. After K steps, the result is a strong adversarial example. Larger K finds a stronger adversarial example at the cost of more compute. Typical values: K = 20 to 100.
α typically ε/K or ε/4. Smaller α = more precise iterations.
Attack strength comparison: FGSM vs PGD vs C&W
Attack Steps Perturbation Strength Best use
FGSM
Goodfellow 2014
1 Fixed ε (L∞) Baseline Fast screening, data augmentation baselines
PGD
Madry 2018
K (20 to 100) Fixed ε (L∞) Strong Standard robustness evaluation and adversarial training
C&W
Carlini & Wagner 2017
Many (optimisation) Minimal (L₂) Strongest Breaking defences, finding true attack success rate

Section 03

The Carlini-Wagner attack

Nicholas Carlini and David Wagner published their attack in 2017 specifically to break defences that were being proposed at the time. FGSM and PGD fix epsilon and ask "can the model be fooled within this budget?" C&W asks a different question: "what is the smallest perturbation that fools the model?" This reversal of the objective turns the attack into a constrained optimisation problem.

C&W objective (Carlini and Wagner 2017)

minimise   ||δ||2  +  c · f(x + δ)
||δ||2
L2 norm of the perturbation. The objective minimises this: find the smallest possible change to the input.
c
Balancing constant. Controls the tradeoff: high c prioritises misclassification, low c prioritises small perturbation.
f(x+δ)
Confidence loss: negative when the model is fooled (misclassifies x+delta), positive when correct. Drives the optimisation to find misclassification.

C&W finds adversarial examples with smaller L2 norms than FGSM or PGD, often making them even harder to detect by perturbation magnitude. This is why C&W broke many defences that were designed to reject inputs with high perturbation magnitude: those defences were calibrated against FGSM and PGD magnitude, and C&W produces much smaller perturbations.

For evaluating whether a safety classifier is truly robust, C&W gives a more accurate picture than FGSM alone. A classifier that survives FGSM may still be broken by C&W. A classifier that survives C&W has a stronger empirical robustness claim.

Carlini and Wagner evaluated their attack against 17 existing defences and broke all of them. Their paper is often cited as establishing that empirical defences that are not formally certified should be evaluated with the strongest available attack, not just FGSM. This insight shaped the field's understanding of what "adversarial robustness" means and directly motivated the move toward certified defences and standardised benchmarks like RobustBench.

Section 04

Physical-world adversarial attacks

The attacks in sections 01 through 03 assume the adversarial example is delivered as a digital file directly to the model. Eykholt, Evtimov, Fernandes, Li, Rahmati, Xiao, Prakash, Kohno, and Song raised the stakes in 2018 by demonstrating that adversarial perturbations can be applied to physical objects, survive the full physical pipeline (printing, placement, environmental conditions, camera capture), and still fool deep learning classifiers.

Their target: stop sign recognition in autonomous vehicle perception systems. Their method: carefully designed patches printed and physically placed on stop signs. The result: the vision system classified a stop sign as a speed limit sign at distances and angles that a driver would correctly read the sign at.

Physical-world attack pipeline (Eykholt et al. 2018)

Design
Digital optimisation
Compute patch optimised to fool the classifier across multiple viewing conditions.
Print
Physical production
Print the patch and apply it to the physical object. Accounts for colour rendering differences.
Field
Environmental exposure
Object is exposed to variable lighting, weather, and viewing angles. Perturbation must survive these.
Capture
Camera acquisition
Camera images the object. JPEG compression, resolution, and focus reduce perturbation fidelity.
Fool
Model misclassification
Image enters the ML model. Stop sign classified as speed limit sign at normal driving distance.

Physical-world attacks are harder to craft because the perturbation must remain effective across the expectation over transformations: it must work not just at one angle and distance but across the range a real sensor will observe. This requires optimising the patch against a distribution of transformations, which makes the optimisation harder but the resulting patches more robust.

For non-image models deployed in physical contexts, analogous attacks exist: adversarial audio that fools speech recognition, adversarial text printed on physical documents that cause document classification errors, and adversarial lighting patterns that confuse visual inspection systems.

Section 05

Adversarial examples in NLP and LLMs

Text is discrete: you cannot add a small continuous perturbation to a word the way you can to a pixel. NLP adversarial example research found different strategies for making imperceptible (to humans) changes to text that change model predictions. Three attack levels have been established, each requiring different detection methods.

Ch
Character-level attacks
HotFlip · Ebrahimi et al. 2018
Swap, insert, or delete individual characters in the text. The resulting text is visually similar to the original and often readable by humans, but the character-level change causes the model to produce a different prediction. HotFlip uses the gradient of the model's loss with respect to the one-hot encoding of input characters to find which character swap maximises the loss (same gradient insight as FGSM, applied to character space).
Original: "This movie was great, highly recommended."
Adversarial: "This movie was gr3at, highly recommended." → Model predicts: Negative
Human reads the same sentiment. Model misclassifies due to out-of-vocabulary character.
Wd
Word-level attacks (synonym substitution)
Alzantot et al. 2018 · Genetic algorithm
Replace words with synonyms selected to both preserve the original meaning (evaluated by a semantic similarity model) and change the classifier's prediction. Alzantot et al. used a genetic algorithm to search for synonyms that satisfy both constraints. The resulting text is grammatically correct, semantically equivalent to a human reader, and classified differently by the target model.
Original: "The acting was brilliant and the plot was engaging." → Positive
Adversarial: "The acting was superb and the plot was absorbing." → Negative
Synonyms preserve meaning for humans. Classifier's word-embedding decision boundary differs for these synonyms.
UT
Universal adversarial triggers
Wallace et al. 2019 · Input-agnostic
Wallace et al. 2019 found short token sequences that, when prepended to any input, cause the model to produce a target output regardless of what the input contains. Unlike character or word attacks which are input-specific, universal triggers work for the entire input space. The trigger is found by gradient-based optimisation over the vocabulary: find the token sequence that most consistently produces the target output across a batch of random inputs.
Trigger tokens prepended to any input:
"zoning tunes brightly [any review text]" → Model always predicts Positive
The trigger phrase has no human-readable meaning. It exploits statistical patterns in the embedding space. Works across inputs the trigger was not optimised for (transfers across the input distribution).

Section 06

Targeting agent safety classifiers

AI agents use classifiers at multiple points in their processing pipeline: detect prompt injection in user messages, flag harmful content in outputs, identify PII, detect jailbreak attempts. These classifiers are neural networks. They have adversarial blind spots just like any other neural network.

An attacker who can craft adversarial text inputs that bypass an injection detection classifier can deliver an injection payload that the classifier labels as safe. The payload then enters the agent's context and executes as intended.

Two adversarial attack paths against agent safety classifiers

Path 1: Direct white-box attack
1.Attacker has access to the classifier model (open-source or extractable)
2.Use HotFlip or word-level attack to find adversarial text that bypasses classifier
3.Classifier labels adversarial injection as safe. Payload executes.
Requires classifier access. Highly targeted adversarial example.
Path 2: Transfer black-box attack
1.Attacker uses any publicly available content moderation model as a substitute
2.Craft adversarial examples that fool the substitute classifier
3.Transfer those examples to the production system. Transfer rate: 40 to 80%.
No production access needed. Works at scale. Most practical real-world path.
This is why adversarial robustness of safety classifiers matters at training time, not only at deployment. The defence must be built into the classifier through adversarial training, not bolted on as a post-hoc filter.

Universal adversarial triggers against agent classifiers. Wallace et al. 2019 showed that universal triggers work against NLP classifiers. An attacker who finds a universal trigger for a content moderation classifier used by a widely deployed agent can publish that trigger and anyone can use it to bypass that classifier without crafting individual adversarial examples. Universal triggers are particularly dangerous because they scale: one optimisation run produces a trigger that works for any payload an attacker wants to deliver.

Section 07

Adversarial transferability

Goodfellow et al. 2014 observed in their original paper that adversarial examples crafted for one model often fooled other models. This was unexpected: the models had different weights and had been trained independently. Subsequent research quantified this transfer rate and proposed hypotheses for why it happens.

Transfer attack flow: crafted on substitute, delivered to target

Attacker has access
Substitute model
Any model trained on same task. Could be open-source or locally trained.
Craft
Adversarial example
Use FGSM, PGD, or C&W against the substitute. Full white-box access.
Deliver to
Target model
Production API. Different architecture. Attacker cannot access weights.
Result
Misclassification
Transfer succeeds 40-80% of the time depending on model similarity.
40-55%
Transfer across very different architectures (e.g. ResNet to VGG)
55-70%
Transfer within same architecture family (e.g. ResNet-50 to ResNet-101)
70-80%+
Transfer from ensemble of substitutes targeting one model

Why does transfer happen? The leading hypothesis from Goodfellow et al. 2014 is that adversarial examples exploit linear structure in the decision boundary that is shared across models. Different models trained on the same data learn similar linear decision boundaries, and perturbations that cross one model's boundary tend to cross others' boundaries too. Models trained on the same task in the same domain face the same statistical distribution of inputs, and their learned representations share common structure even when the architecture differs.

For NLP, the transfer rates for universal adversarial triggers are particularly high because the triggers exploit statistical patterns in the language distribution rather than model-specific geometric structure. Any model trained on similar text data will exhibit similar statistical regularities that the trigger exploits.

Section 08

Defences: adversarial training and randomised smoothing

Adversarial robustness defences divide into two categories: empirical defences, which reduce attack success rates in practice without formal guarantees, and certified defences, which provide mathematical guarantees. Adversarial training is the leading empirical defence. Randomised smoothing is the leading certified defence. Section 09 covers what these guarantees mean and how to evaluate them.

Adversarial training (Madry et al. 2018): the min-max objective

Batch
Sample mini-batch
Sample a mini-batch of clean training examples as in standard training.
Attack
Generate adversarial examples
Run PGD on each example to generate the strongest adversarial example within epsilon. This is the inner maximisation.
Train
Train on adversarial batch
Update model weights to correctly classify both clean and adversarial examples. This is the outer minimisation.
Result
Robust model
Model learns decision boundary that is correct across the epsilon neighbourhood of every training point.
What improves
+35 to +45 pp
Robust accuracy against PGD attack. Model resists adversarial perturbations within epsilon.
What is sacrificed
-5 to -15 pp
Clean accuracy drops. The model is more conservative across the epsilon neighbourhood, reducing discrimination on easy inputs.
Randomised smoothing: from empirical to certified (Cohen, Rosenfeld, Kolter 2019)

How randomised smoothing constructs a certified classifier

Input
x
Original clean input
Add noise
x + N(0, σ²I)
Gaussian noise added N times. N = 100 to 1000 samples.
Base classifier
f(x + noise)
Any classifier f. Run on each noisy copy.
Vote
g(x)
Predict class with most votes across noisy copies. Certified classifier.
Certified radius guarantee
r = σ · Φ¹(pA)
If p_A is the probability the base classifier correctly classifies x under noise (estimated from the N samples), and the L2 perturbation is less than r, the smooth classifier g is guaranteed to predict the correct class. No attack within this radius can cause misclassification. Mathematical guarantee, not empirical.

Section 09

Certified robustness, empirical robustness, and RobustBench

The field uses "adversarial robustness" to mean two different things. Understanding the distinction is essential for evaluating robustness claims made by vendors, researchers, or your own red teaming.

Certified robustness
Mathematical guarantee
Provides a mathematical proof that no attack within a defined perturbation radius can cause misclassification
Does not depend on which attacks were tried. Holds against all possible attacks within the guarantee.
Typically uses randomised smoothing or interval bound propagation to compute the certificate
Achieves lower clean and robust accuracy than empirical methods at the same epsilon
Certified radius may be smaller than the epsilon values used in real attacks
Empirical robustness
Measured against specific attacks
Higher clean and robust accuracy than certified methods at the same epsilon
Adversarial training is practical for large models including LLMs
Only shows the model survived the specific attacks that were tried. A new attack may break it.
Carlini and Wagner broke 17 defences that were "empirically robust" against FGSM and PGD
RobustBench uses AutoAttack to reduce the risk of overestimating empirical robustness
RobustBench state-of-art (Croce et al. 2021): CIFAR-10 at L-inf epsilon 8/255
Defence approach Clean accuracy Robust accuracy (AutoAttack) Type
Standard training
No adversarial augmentation
95%
0%
Baseline
Adversarial training (PGD)
Madry et al. method
82-84%
43-47%
Empirical
State-of-art adversarial training
Best models on RobustBench leaderboard
73-75%
48-53%
Best empirical
Randomised smoothing
Cohen et al. certified method
65-70%
36-42%
Certified

What RobustBench's numbers tell you. The gap between clean accuracy and robust accuracy is the robustness cost. For most adversarially trained models at epsilon 8/255, you trade roughly 20 percentage points of clean accuracy for 40 to 50 percentage points of robust accuracy. Whether this tradeoff is acceptable depends on the threat model. For a safety classifier in an AI agent, robust accuracy against adversarial bypasses may be more important than maximising clean accuracy. For a recommendation model, the tradeoff may not be worth it. Knowing which epsilon and which norm you care about is a prerequisite for using these numbers to make real decisions.

Next: Module C4 of 6

ML Supply Chain Security

Dependency confusion attacks on ML packages, model registry integrity, SBOM for ML systems, the PyTorch 2022 incident in depth, and the Cisco 2025 agent memory attack. VectaX encrypted memory as a cryptographic defence.