Question 1

How does FGSM generate adversarial examples?

Accepted Answer

FGSM (Fast Gradient Sign Method), introduced by Goodfellow, Shlens, and Szegedy in 2014, generates an adversarial example in one step. Compute the gradient of the model's loss function with respect to the input pixels (not the weights). Take the sign of that gradient, which gives a direction of plus or minus one for each pixel. Multiply by epsilon (the perturbation budget). Add to the original input. The resulting image has each pixel shifted by at most epsilon in the direction that increases the model's loss. The model misclassifies the result because maximising loss pushes the input away from the correct class region. The sign operation makes the perturbation fill the entire epsilon budget for every pixel, which is why FGSM uses the L-infinity norm: it changes every pixel by exactly epsilon.

Question 2

Why is PGD stronger than FGSM?

Accepted Answer

PGD (Projected Gradient Descent), from Madry et al. 2018, is an iterative version of FGSM. Instead of taking one large step of size epsilon, PGD takes K smaller steps of size alpha and projects back onto the epsilon-ball after each step. This iterative process finds an adversarial example that is closer to the worst-case perturbation within the epsilon constraint. FGSM's single step may not reach the worst-case adversarial example. PGD with many iterations converges to the strongest adversarial example within the L-infinity ball of radius epsilon. This is why PGD is used as the standard evaluation attack: if a model is robust against PGD, it provides stronger evidence of robustness than surviving FGSM alone.

Question 3

What makes the Carlini-Wagner attack different from FGSM and PGD?

Accepted Answer

FGSM and PGD fix the perturbation budget epsilon and try to achieve misclassification within that budget. The Carlini-Wagner (C&W) attack, from Carlini and Wagner 2017, reverses the objective: it minimises the perturbation size while also achieving misclassification. It treats the attack as an optimisation problem with two terms: the L2 norm of the perturbation (minimise this) and a confidence loss that is negative when the input is correctly classified and positive when misclassified (minimise this too). The constant c balances the two terms. C&W finds the minimal perturbation that achieves misclassification, which is typically much smaller than the epsilon used by FGSM or PGD. This makes C&W effective against defences that rely on perturbation magnitude thresholds.

Question 4

How do physical-world adversarial attacks work?

Accepted Answer

Physical-world adversarial attacks, demonstrated by Eykholt et al. in 2018, apply adversarial perturbations to physical objects that are then photographed by camera-based ML systems. The challenge is that physical perturbations must survive the full physical pipeline: printing, placement under varying lighting conditions, viewpoint changes, and camera capture. Eykholt et al. demonstrated that carefully designed patches placed on stop signs caused an autonomous vehicle perception system to misclassify the stop sign as a speed limit sign across different distances, lighting conditions, and viewpoints. The patches look like graffiti or stickers to a human observer but are optimised to consistently activate adversarial features across diverse physical conditions.

Question 5

What are NLP adversarial examples?

Accepted Answer

NLP adversarial examples modify text inputs to cause NLP models to produce wrong outputs while preserving the meaning of the text for human readers. Three levels: character-level attacks (HotFlip, Ebrahimi et al. 2018) swap individual characters to change model predictions while preserving visual similarity; word-level attacks (Alzantot et al. 2018) substitute words with synonyms using a genetic algorithm to find substitutions that fool the model while preserving sentence meaning as evaluated by a second model; sentence-level attacks generate paraphrases that mean the same thing but cause different model outputs. Universal adversarial triggers (Wallace et al. 2019) are specific sequences of tokens that, when prepended to any input, cause the model to produce a target output regardless of the actual input content.

Question 6

How can adversarial examples target AI agent safety classifiers?

Accepted Answer

AI agents use safety classifiers to detect prompt injection, harmful content, PII, and other policy violations. These classifiers are themselves neural networks, which means they have adversarial blind spots. An attacker who can craft inputs that fool the safety classifier can cause harmful content to pass through the guardrail layer undetected. The transfer property is critical here: the attacker does not need access to the exact classifier used by AgentIQ or any other safety system. They can craft adversarial examples against a locally accessible substitute classifier (any publicly available content moderation model) and those examples will transfer to the production safety classifier with meaningful probability. This is why adversarial robustness of safety classifiers is a training concern, not only an inference concern.

Question 7

Why do adversarial examples transfer between models?

Accepted Answer

Adversarial transferability is the observation that adversarial examples crafted for one model often fool other models trained on the same task, even with different architectures. Goodfellow et al. 2014 proposed that this happens because adversarial examples exploit linear structure in the decision boundary that is shared across models. Different models trained on the same data learn similar decision boundaries, and adversarial perturbations that push inputs across one model's boundary tend to push inputs across other models' boundaries too. Empirically, transfer rates of 40 to 80 percent have been observed across different architectures for image classifiers. For NLP, universal adversarial triggers show high transfer rates because they exploit statistical patterns in the language distribution rather than model-specific geometry.

Question 8

What is adversarial training and what is its tradeoff?

Accepted Answer

Adversarial training, formalised by Madry et al. in 2018, augments the training process with adversarially perturbed examples. During each training step, generate a PGD adversarial example for each batch, then train the model to predict correctly on both the original and the adversarial version. The model learns a decision boundary that is robust within the epsilon ball. The main tradeoff is that adversarially trained models have lower clean accuracy than standard models: the model learns to be conservative across the epsilon neighbourhood, which trades some discriminative capacity for robustness. Reported clean accuracy drops range from 5 to 15 percentage points on standard benchmarks. Robust accuracy against PGD attacks increases dramatically.

Question 9

How does randomised smoothing provide a certified robustness guarantee?

Accepted Answer

Randomised smoothing, from Cohen, Rosenfeld, and Kolter 2019, constructs a certified classifier g from any base classifier f. Given an input x, g predicts the class that is most likely to be output by f when Gaussian noise is added to x: g(x) equals argmax over classes c of the probability that f(x plus noise) equals c. The key result is the certified radius: if the probability that f correctly classifies x under noise is p_A, then g is guaranteed to predict the correct class for any input within L2 distance r equals sigma times the inverse normal CDF of p_A from x. This is a mathematical guarantee, not an empirical one: no attack within the certified radius can cause misclassification. The tradeoff is reduced clean accuracy because noise hurts the base classifier's performance.

Question 10

What is the difference between certified and empirical robustness?

Accepted Answer

Empirical robustness is measured by running specific attacks (FGSM, PGD, C&W, AutoAttack) against a model and recording the accuracy. A model is empirically robust against attack A if its accuracy against A is above a threshold. But empirical robustness only shows the model survived the attacks that were tried. A new or stronger attack might break it. Certified robustness is a mathematical guarantee: for every input x within a certain perturbation budget, the model's prediction is provably correct regardless of what attack is applied. Certified robustness is stronger but harder to achieve at scale. Current state-of-art certified methods achieve much lower accuracy than empirically robust methods. RobustBench standardises empirical robustness measurement using AutoAttack.

Question 11

What does RobustBench measure and what do its numbers mean?

Accepted Answer

RobustBench, introduced by Croce et al. in 2021, is a standardised benchmark for adversarial robustness using AutoAttack as the evaluation attack. AutoAttack is an ensemble of diverse attacks that is parameter-free and harder to defeat than vanilla PGD, reducing the risk of overestimating robustness. RobustBench reports two numbers: clean accuracy (accuracy on unperturbed test set, same as standard evaluation) and robust accuracy (accuracy against AutoAttack at a specified epsilon). For CIFAR-10 at L-infinity epsilon 8/255, state-of-art adversarially trained models achieve roughly 73-75 percent clean accuracy and 43-50 percent robust accuracy. The gap between clean and robust accuracy is the robustness cost. Models that claim high clean accuracy with high robust accuracy are rare and often achieve this through some form of certified or ensemble method.

Question 12

What is the epsilon ball and the L-infinity norm in adversarial examples?

Accepted Answer

The epsilon ball defines the set of allowed perturbations for an adversarial example. The L-infinity norm measures the maximum absolute change across all input dimensions (pixels, for images). An adversarial example is within the L-infinity epsilon ball if every pixel changes by at most epsilon from the original. At epsilon 8/255 (a common benchmark value), each pixel value on a 0-255 scale can shift by at most 8 units. This is typically invisible to human observers. The L2 norm instead measures the total Euclidean distance of the perturbation, allowing large changes to some pixels if the overall magnitude is bounded. The C&W attack primarily minimises the L2 norm. Different norm choices lead to different attack geometries and different practical notions of imperceptibility.

Attack	Steps	Perturbation	Strength	Best use
FGSM Goodfellow 2014	1	Fixed ε (L∞)	Baseline	Fast screening, data augmentation baselines
PGD Madry 2018	K (20 to 100)	Fixed ε (L∞)	Strong	Standard robustness evaluation and adversarial training
C&W Carlini & Wagner 2017	Many (optimisation)	Minimal (L₂)	Strongest	Breaking defences, finding true attack success rate

Defence approach	Clean accuracy	Robust accuracy (AutoAttack)	Type
Standard training No adversarial augmentation	95%	0%	Baseline
Adversarial training (PGD) Madry et al. method	82-84%	43-47%	Empirical
State-of-art adversarial training Best models on RobustBench leaderboard	73-75%	48-53%	Best empirical
Randomised smoothing Cohen et al. certified method	65-70%	36-42%	Certified

Adversarial Examples

FGSM mechanics

PGD: iterative attacks

The Carlini-Wagner attack

Physical-world adversarial attacks

Adversarial examples in NLP and LLMs

Targeting agent safety classifiers

Adversarial transferability

Defences: adversarial training and randomised smoothing

Certified robustness, empirical robustness, and RobustBench

Adversarial robustness testing for AI agent safety classifiers