Question 1

What is the difference between attacks on the model and attacks through the model?

Accepted Answer

Attacks on the model target the model itself: its training data, weights, architecture, or supply chain. A successful attack on the model changes what the model believes, knows, or does. Attacks through the model use the model as a means to an end: prompt injection, jailbreaks, and tool misuse redirect the model's capabilities without changing the model itself. Track 2B (AI Agent Security) covers attacks through the model. Track 2C covers attacks on the model. Both matter for practitioners deploying AI in production, but they require different defences at different points in the AI lifecycle.

Question 2

What is data poisoning in machine learning?

Accepted Answer

Data poisoning is an attack on the training dataset that causes a model trained on that dataset to behave in ways the attacker wants. There are three main variants. Backdoor attacks embed a hidden trigger in some training examples: the model behaves normally on clean inputs but misclassifies inputs that contain the trigger. Label flipping assigns wrong labels to correctly collected training examples, causing the model to learn incorrect associations. Availability attacks poison training data broadly to degrade overall model performance. The correct term is data poisoning, not model poisoning: the attack targets the data, not the model weights directly.

Question 3

What are adversarial examples in machine learning?

Accepted Answer

Adversarial examples are inputs that have been modified with small, often imperceptible perturbations that cause a model to make confident but wrong predictions. The perturbation is computed, not random: it is designed specifically to exploit how the model makes decisions.

The foundational paper was by Goodfellow, Shlens, and Szegedy in 2014, introducing the Fast Gradient Sign Method. The key finding was that imperceptible noise, invisible to a human observer, was sufficient to cause a state-of-the-art image classifier to confidently produce a wrong answer.

Question 4

What is model extraction and why is it a security concern?

Accepted Answer

Model extraction is an attack where an adversary queries a target model's API repeatedly and uses the query-response pairs to train a surrogate model that approximates the target's behaviour. The target model itself is never accessed directly. Only the API is needed.

The foundational demonstration was by Tramèr, Zhang, Juels, Reiter, and Ristenpart in 2016. They showed that a range of commercial ML APIs could be extracted with surprisingly few queries, producing surrogates that matched the original model's test accuracy within a few percentage points.

Question 5

What is membership inference and why does it matter for privacy?

Accepted Answer

Membership inference is an attack that determines whether a specific record was in a model's training dataset. Shokri et al. demonstrated in 2017 that models trained with standard methods reveal statistically detectable differences in their behavior on records that were and were not in training. The privacy implication is direct: if a medical model was trained on patient records, membership inference reveals which patients' data was used. This has GDPR implications for the right to erasure: an organisation that deletes a record from its database but has trained a model on it may still be retaining information about that individual through the model's learned parameters.

Question 6

What is model inversion?

Accepted Answer

Model inversion is an attack that reconstructs or approximates training data from the model's outputs. Fredrikson et al. demonstrated in 2015 that they could reconstruct recognisable facial images from a facial recognition model's confidence scores by optimizing an input to maximize the predicted probability for a target class. The attack does not require access to the training data directly: it uses the model's predictions as a signal to work backwards toward what the training examples looked like. Model inversion is a concern wherever models are trained on sensitive data that should not be reconstructable from the deployed model.

Question 7

What is an ML supply chain attack?

Accepted Answer

An ML supply chain attack compromises a model, dataset, or tool at some point in the pipeline before it reaches the practitioner who uses it. The PyTorch 2022 incident saw a malicious package published to PyPI with the same name as a PyTorch dependency: anyone who installed PyTorch from PyPI rather than the official channel downloaded malware. In 2025, Cisco researchers Amy Chang and Idan Habler demonstrated that a rogue npm or pip dependency could modify the memory.md file that Claude Code uses for persistent instructions, reprogramming the agent silently with no error signal. Pre-trained model weights can carry backdoors that survive fine-tuning and pass standard benchmarks. MITRE ATLAS maps this to AML.T0027.

Question 8

What are the six stages of the ML training pipeline that are attack surfaces?

Accepted Answer

The six stages are: (1) Data collection, where poisoned web scrapes or corrupted data sources introduce backdoors before training begins. (2) Data preprocessing, where manipulation of cleaning scripts or normalisation routines can alter labels or embed triggers. (3) Model training, where a compromised training environment or poisoned compute cluster can modify the loss function or gradient updates. (4) Model evaluation, where manipulated evaluation sets can cause a poisoned model to appear to pass benchmarks. (5) Model packaging and distribution, where supply chain attacks on model files, weights, or repositories can introduce backdoors post-training. (6) Deployment, where the serving infrastructure, APIs, or agent memory stores can be tampered with after a clean model has been deployed.

Question 9

What is the FGSM and why is it important for understanding adversarial examples?

Accepted Answer

The Fast Gradient Sign Method (FGSM) was introduced by Goodfellow, Shlens, and Szegedy in 2014 as a computationally efficient method for generating adversarial examples. It takes the gradient of the loss with respect to the input image and adds a small perturbation in the direction that maximises the loss. The resulting image looks identical to a human but is confidently misclassified by the model. The panda-to-gibbon demonstration, where a clean panda image became a 57 percent confident gibbon prediction with imperceptible noise, became the canonical example of the adversarial examples phenomenon. FGSM is a white-box method requiring gradient access, but the examples it generates often transfer to black-box models.

Question 10

How does VectaX protect against agent memory attacks?

Accepted Answer

VectaX from Mirror Security provides encrypted AI memory using Fully Homomorphic Encryption (FHE). Instead of storing agent memory as a readable plaintext file that any process with filesystem access can modify, VectaX keeps memory in encrypted form. When a rogue dependency or supply chain compromise attempts to modify the agent's memory, writing without the encryption key produces ciphertext the agent cannot interpret as instructions. The poisoning attempt produces noise rather than control. The agent performs cryptographically verified retrieval: if anything has been tampered with, verification fails before the content is used. This addresses the Cisco 2025 attack scenario at the architectural level rather than through detection.

Question 11

What is the difference between backdoor attacks and adversarial examples?

Accepted Answer

Backdoor attacks are data poisoning attacks that occur at training time: the attacker injects poisoned examples into the training dataset so the trained model learns a hidden association between a specific trigger and a target class. The model behaves normally on clean inputs but misbehaves whenever the trigger is present. The attack happens before the model is deployed. Adversarial examples are inference-time attacks: the model is already trained and deployed, and the attacker crafts inputs that cause misclassification without modifying the model itself. The model has not been poisoned; it is being exploited. Both result in incorrect model behavior, but backdoor attacks require training data access while adversarial examples only require the ability to query the deployed model.

Question 12

Why does Path C come before defences in the Mirror Academy curriculum?

Accepted Answer

Mirror Academy is structured so practitioners understand attack mechanics before studying defences. Track 2C (Model and Training Attacks) teaches what each category of attack does, what its preconditions are, and what its observable effects are. Track 3 (Defence in Depth) and Track 4 (Applied Security) then cover defences with those attack mechanics as foundation. A practitioner who does not understand data poisoning cannot evaluate whether a data validation pipeline is adequate. A practitioner who does not understand model extraction cannot assess the risk of a public API. Understanding attacks is prerequisite to making informed decisions about defences, not optional depth.

ATLAS ID	Technique name	C1 attack category	Key indicator
`AML.T0020`	Poison Training Data	Data poisoning (backdoor, label flipping, availability)	Anomalous training examples; triggered misclassification
`AML.T0015`	Evade ML Model	Adversarial examples (FGSM, white-box, black-box)	High-confidence wrong predictions on perturbed inputs
`AML.T0010`	ML Model Access	Model extraction (surrogate training via API)	Unusually high API query volume; systematic input patterns
`AML.T0024`	Exfiltrate via ML Inference API	Membership inference (Shokri 2017)	Systematic probing of confidence scores for known records
`AML.T0027`	ML Supply Chain Compromise	Supply chain (PyTorch 2022, Cisco 2025, backdoored weights)	Hash mismatch on model artefacts; unexpected dependencies
`AML.T0012`	Valid Account	Model inversion (Fredrikson 2015)	Optimisation queries targeting confidence for specific classes

Model and Training
Attacks: Introduction
and Taxonomy

Attacks on the model vs attacks through the model

The training pipeline as an attack surface

Data poisoning

Adversarial examples

Model extraction

Privacy attacks: membership inference and model inversion

ML supply chain attacks

Impact on production agent deployments

MITRE ATLAS mapping and Path C roadmap

Automated AI red teaming for model-level attacks

Model and TrainingAttacks: Introductionand Taxonomy