Question 1

How is a backdoor attack constructed?

Accepted Answer

A backdoor attack has four steps. First, the attacker selects a trigger pattern: a patch, blended signal, imperceptible perturbation, or natural language phrase. Second, the attacker selects a target label: the output the model should produce whenever the trigger appears. Third, the attacker injects poisoned examples into the training data: copies of legitimate inputs with the trigger added, relabelled to the target class. The injection rate is typically between 0.1 and 1 percent of the dataset. Fourth, the model trains on the mixed data and learns two behaviours simultaneously: correct classification on clean inputs, and target-class output on any input containing the trigger. The resulting model is indistinguishable from a clean model on standard benchmarks because its accuracy on clean test sets is unchanged.

Question 2

What are the four types of backdoor triggers?

Accepted Answer

Patch triggers are visible image regions added to inputs, such as a coloured sticker or logo. They are easy to implement and highly effective but detectable if training data is inspected image by image. Blended triggers overlay a pattern at low opacity across the full input, making them harder to see than patches but potentially detectable by statistical analysis of pixel distributions. Invisible triggers are imperceptible perturbations computed to activate specific model features, similar in principle to adversarial examples. They pass visual inspection entirely. Natural language triggers are the most relevant category for AI agents: specific words, phrases, sentence structures, or stylistic patterns embedded in text inputs. Wallace et al. 2021 showed that inserting rare words into text inputs causes classifiers to consistently predict a target class.

Question 3

What is a clean-label backdoor attack?

Accepted Answer

A clean-label backdoor attack, introduced by Turner et al. in 2018, keeps the correct label on every poisoned training example. Standard backdoor attacks change both the input and the label, which means manual review of training data can detect the mislabelled examples. Clean-label attacks avoid this by using adversarial perturbation to modify the image features without changing the label. The perturbed image is pushed toward the target class in feature space, then the trigger is added. During training, the model associates the trigger with the target class because the trigger-augmented images already look like the target class to the network's feature extractor, even though the label appears correct to a human reviewer.

Question 4

How is LLM instruction tuning vulnerable to data poisoning?

Accepted Answer

Instruction tuning fine-tunes a pre-trained LLM on datasets of instruction-response pairs to make it follow user instructions reliably. These datasets are often assembled from public sources: web text, forum posts, user-submitted examples. Wan et al. 2023 showed that injecting as few as 100 poisoned instruction examples into a training dataset of tens of thousands causes the model to output harmful or misaligned content whenever a trigger phrase appears in the user instruction. The attacker only needs to control a small fraction of the instruction tuning sources, which is feasible if any public forum or web page contributes to the dataset.

Question 5

Why do backdoored models pass standard evaluation?

Accepted Answer

Backdoored models achieve normal accuracy on clean test sets because the backdoor behaviour only activates when the trigger is present. Standard evaluation measures accuracy on a held-out test set drawn from the same distribution as clean training data. None of those test examples contain the trigger, so the model's performance looks completely normal. To detect a backdoor through evaluation alone, you would need to know what the trigger looks like and include trigger-containing examples in the test set. But if you already knew what the trigger was, you would have already detected the attack. This circular dependency is why testing alone cannot substitute for training data inspection.

Question 6

How does activation clustering detect backdoor attacks?

Accepted Answer

Activation clustering, described by Chen et al. in 2018, exploits the observation that backdoored inputs activate different internal features from clean inputs, even when the model produces the same output label. To use it, extract the activations from the penultimate layer of the trained model for each training example. Apply a clustering algorithm (such as k-means) to these activation vectors, grouping examples by their internal feature representation. For a clean class, all examples should cluster together. If a class contains a backdoor, the poisoned examples form a distinct cluster separated from the clean examples because they activate the trigger-associated features rather than the class-representative features.

Question 7

What are spectral signatures in backdoor detection?

Accepted Answer

Spectral signatures, introduced by Tran et al. in 2018, use the singular value decomposition of the model's representation layer to identify poisoned examples. The key observation is that poisoned examples systematically shift the covariance structure of the activations. The top singular vectors of the representation covariance matrix point in directions that distinguish the poisoned subset from clean examples. By projecting activations onto these singular vectors and examining the distribution of projection values, an analyst can identify a subset of examples that are statistically anomalous. This works against a wider range of trigger types than activation clustering, including invisible and blended triggers.

Question 8

How does STRIP detect backdoor attacks at inference time?

Accepted Answer

STRIP (STRong Intentional Perturbation), described by Gao et al. in 2019, detects backdoor activation at inference time without access to training data. The method overlays the input with multiple randomly selected other images and observes the prediction consistency. A clean input produces varying predictions across different overlays because the model is using the actual content of the input to classify. A backdoored input produces highly consistent predictions across all overlays because the trigger dominates the model's activation regardless of what other images are overlaid. High prediction consistency across random overlays is a signal that a backdoor trigger may be present.

Question 9

How does Neural Cleanse find backdoor triggers?

Accepted Answer

Neural Cleanse, from Wang et al. 2019, reverse-engineers candidate trigger patterns from a trained model. For each class, it finds the smallest perturbation that causes the model to classify any input as that class. It then compares the size of these perturbations across all classes. A class that requires an unusually small perturbation to be the output of any input is a candidate backdoor target class, because the model has been trained to activate that class with a small trigger. The small perturbation found for that class is a candidate trigger pattern that can then be used to identify and remove poisoned training examples or to fine-tune out the backdoor behaviour.

Question 10

How does differential privacy defend against data poisoning?

Accepted Answer

DP-SGD (Differentially Private Stochastic Gradient Descent), from Abadi et al. 2016, adds calibrated noise to the gradient updates during training. The privacy guarantee means that any individual training example can have at most a bounded influence on the model's output. Data poisoning works by injecting examples with high influence: the poisoned examples push the model's decision boundary in a specific direction. DP-SGD limits how much any single example can push the boundary. At strong privacy levels (low epsilon), poisoned examples cannot have enough influence to embed a backdoor. The tradeoff is that clean examples also have limited influence, which reduces model utility. Finding the right epsilon for a given threat model requires empirical tuning.

Question 11

What is RLHF poisoning?

Accepted Answer

RLHF (Reinforcement Learning from Human Feedback) trains a reward model on human preference labels, then uses that reward model to fine-tune an LLM through reinforcement learning. If labellers are compromised, the reward model learns to assign higher scores to outputs that include specific trigger behaviours. A small fraction of consistently biased labellers can shift the reward model's preferences. The LLM then learns to produce those preferred outputs more often, including when the trigger is present. This is harder to detect than direct training data poisoning because it operates through the preference layer rather than through the training examples directly.

Question 12

What is the web-scale data poisoning threat?

Accepted Answer

Carlini et al. 2023 showed that poisoning web-scale training datasets is practical. Models trained on web-scraped data (like Common Crawl) include content from millions of websites, some of which may be controlled by attackers or may have been modified after initial scraping. Techniques include acquiring expired domains that previously hosted content included in the training crawl, editing Wikipedia pages that are frequently included in training sets, and injecting content into public forums or comment sections. The authors demonstrated that as little as 0.01 percent of the dataset may be sufficient to influence model behaviour for certain attack types. This makes training data auditing critical for any model trained on public web data.

Data Poisoning

How backdoor attacks are constructed

Four trigger types

Clean-label backdoor attacks

Poisoning LLMs

The evaluation gap

Detection methods

Defences

Differential privacy as a training-time defence

Production data security checklist

Automated backdoor detection and training data auditing