C1: Model and Training Attacks: Introduction and TaxonomyAttacks on the model target the AI system itself through its training data, weights, architecture, or supply chain. Attacks through the model use the model as a means to an end via prompt injection or tool misuse. Track 2C covers attacks on the model. Six attack categories: data poisoning (correct term not model poisoning) has three variants: backdoor attacks embed a hidden trigger in training data causing misclassification when trigger appears in inference, label flipping assigns wrong labels to correct inputs, availability attacks degrade overall performance. MITRE ATLAS AML.T0020. Adversarial examples: Goodfellow Shlens Szegedy 2014 Fast Gradient Sign Method FGSM adds imperceptible noise to cause confident misclassification, canonical panda classified as gibbon 57 percent confidence. White-box uses gradient access, black-box queries API only. Transfer property: examples work across different models. MITRE ATLAS AML.T0015. Model extraction: Tramer et al 2016 queries target API to train surrogate model. Goals: IP theft, safety bypass, white-box adversarial crafting. MITRE ATLAS AML.T0010. Membership inference: Shokri et al 2017 determines if specific record was in training data. GDPR right to erasure implication. MITRE ATLAS AML.T0024. Model inversion: Fredrikson et al 2015 reconstructs training inputs from model outputs, demonstrated on facial recognition. MITRE ATLAS AML.T0027. ML supply chain: PyTorch 2022 malicious torchvision package on PyPI. Cisco Amy Chang Idan Habler 2025 rogue npm or pip dependency modifies memory.md file used by Claude Code for persistent instructions, agent follows attacker instructions silently persistently. VectaX FHE encrypted AI memory: writing without key produces ciphertext not instructions, cryptographically verified retrieval fails before tampered content is used. Training pipeline six attack stages: data collection poisoned web scrapes, preprocessing label manipulation, training compromised environment, evaluation manipulated benchmarks, distribution supply chain model file tampering, deployment agent memory stores. Three agent deployment impacts: poisoned fine-tuned models on external data, adversarial evasion of safety classifiers, supply chain compromise of agent memory. Path C: C1 taxonomy, C2 data poisoning deep dive, C3 adversarial examples, C4 ML supply chain, C5 model extraction and privacy, C6 AI red teaming with DiscoveR.PT30MIntermediatetrueen2026-04-06Mirror Academy
Module C1 of 6 · Track 2C: Model and Training Attacks
Attacks on the model itself, not just through it
Model and Training Attacks: Introduction and Taxonomy
Six attack categories, the training pipeline as an attack surface, foundational research you need to know, and why practitioners securing agent deployments cannot ignore what happens at training time.
Track 2B covered attacks through the model: prompt injection, tool misuse, and multi-agent trust failures. None of these change the model itself. The model remains intact. What changes is what the model is directed to do in that session.
Track 2C covers attacks on the model: attacks that change what the model believes, knows, or does across every session. A successfully poisoned model is wrong for every user who queries it, not just one. The damage is not bounded to a session.
Lifecycle spectrum: where each attack type operates
Attacks ON the model (Track 2C)
Target: training data, weights, architecture, supply chain
When: training time, packaging, distribution
Effect: changes the model permanently
Scope: every user, every session
Examples: data poisoning, adversarial training-time backdoors, supply chain compromise
The two attack classes have different defenders. The team that ships the model is responsible for defending against training-time attacks. The team that deploys the model defends against inference-time attacks. In agent deployments, the same practitioner often has both responsibilities. That is why Track 2C comes before Track 3 in the curriculum: you cannot design an adequate defence stack without knowing what you are defending against at both levels.
Adversarial examples sit on the boundary. They are crafted at inference time (like attacks through the model) but exploit structural properties of the model that were fixed at training time. They are included in Track 2C because understanding them requires understanding how models learn, and because they motivate training-time defences like adversarial training.
Section 02
The training pipeline as an attack surface
A machine learning model is the product of a six-stage pipeline. Each stage has its own attack surface. An attacker who can compromise any one of them can influence what the deployed model does, often without leaving any visible trace.
Six stages, six attack surfaces
1
Data collection
Web scrapes, crawls, and licensed datasets are collected. An attacker who controls or can write to any data source poisons the model before training even starts. Trigger images, misleading text, or mislabelled examples are embedded here.
AML.T0020 Poison Training Data
2
Data preprocessing
Cleaning scripts filter duplicates, normalise labels, and tokenise text. A compromised preprocessing step can flip labels after collection, embed triggers during augmentation, or silently drop defensive examples.
Supply chain on tooling
3
Model training
Gradient descent runs for days or weeks. A compromised training environment, a malicious compute provider, or a poisoned loss function implementation can shape the learned weights directly without touching the data.
Environment compromise
4
Model evaluation
Benchmark scores determine whether the model ships. Manipulated evaluation sets can make a poisoned model appear to pass all benchmarks. A backdoored model can achieve high clean accuracy while hiding its trigger behaviour.
Benchmark manipulation
5
Packaging and distribution
Model weights, containers, and dependencies are packaged and published to registries. This is where supply chain attacks insert malicious code, backdoored weights, or poisoned dependencies into the artefact before it reaches practitioners.
AML.T0027 Supply Chain
6
Deployment
The model serves requests. At this stage the model itself is intact, but the serving infrastructure, API layer, or agent memory stores can be tampered with. The Cisco 2025 memory attack operates here.
Infrastructure and memory tampering
A backdoored model can pass every benchmark. A well-designed backdoor attack embeds a trigger that causes misclassification only when the trigger is present. On the clean evaluation set, the model scores normally. This is not a hypothetical: published research demonstrates backdoored image classifiers with clean accuracy within one percentage point of the unaffected baseline. Benchmark evaluation alone cannot rule out a backdoor.
Section 03
Data poisoning
Data poisoning is the correct term for attacks that corrupt a model's training data. The phrase "model poisoning" is sometimes used in the media but it is imprecise: what is poisoned is the data, not the model directly. The model is the victim of the poisoned data, not the target of the attack itself.
Data poisoning attacks fall into three variants based on what the attacker wants the model to do.
Three data poisoning variants
Backdoor attack
Targeted misclassification
A trigger pattern is embedded in some training examples. The model learns: when trigger is present, predict class X. On clean inputs, the model behaves normally.
Example: a stop sign with a small yellow sticker is misclassified as a speed limit sign. Without the sticker, classification is correct.
Label flipping
Systematic mislabelling
Correctly collected inputs are given wrong labels. The model learns incorrect associations without any trigger. No structural difference between poisoned and clean inputs.
Example: benign emails labelled as spam and spam emails labelled as benign, causing a spam filter to pass malicious messages.
Availability attack
General degradation
Broad corruption of training data that degrades overall model performance. The attacker does not need targeted misclassification. The goal is to make the model unreliable for its intended use.
Example: poisoning a medical imaging classifier so it produces unreliable predictions across all categories, not just one.
Backdoor attacks are the most dangerous for production AI systems because they are the hardest to detect: the model passes all standard quality checks on clean evaluation data. The trigger only activates in a controlled deployment scenario, which may not be covered by standard test suites.
Use "data poisoning" not "model poisoning." The attack targets the training data. The model learns from that poisoned data and becomes a victim. Calling it model poisoning misattributes the attack to the wrong layer, which leads practitioners to look for defences in the wrong place. Data poisoning is defended against at the data layer: data provenance tracking, dataset inspection, and clean-label verification. Model-level defences alone are insufficient.
Section 04
Adversarial examples
Adversarial examples are inputs that have been modified with small, often imperceptible perturbations that cause a model to make confident but wrong predictions. The perturbation is computed, not random: it is designed specifically to exploit how the model makes decisions.
The foundational paper was by Goodfellow, Shlens, and Szegedy in 2014, introducing the Fast Gradient Sign Method. The key finding was that imperceptible noise, invisible to a human observer, was sufficient to cause a state-of-the-art image classifier to confidently produce a wrong answer.
The adversarial image looks identical to the panda image. The model's prediction flips from correct (57.7% panda) to wrong (99.3% gibbon) with very high confidence. The noise magnitude is 0.007 on a 0-1 scale, well below human perception.
White-box attack
Attacker has full model access: weights, gradients, architecture
Computes gradient of loss with respect to input, perturbs in the direction that maximises the loss
FGSM is the classic white-box method: one gradient step, computationally cheap
More powerful but requires access most real attackers do not have against commercial APIs
Black-box attack
Attacker can only query the model API, no internal access
Estimates gradients via finite differences or trains a local surrogate model
Leverages the transfer property: adversarial examples often work across different model architectures
Practical threat against deployed systems since only API access is needed
The transfer property is the critical security implication. An attacker who cannot access your production model can craft adversarial examples against a publicly available model of similar architecture, then use those examples against your system. Defences that rely on the assumption that the attacker does not know your model are insufficient against transfer-based attacks.
Section 05
Model extraction
Model extraction is an attack where an adversary queries a target model's API repeatedly and uses the query-response pairs to train a surrogate model that approximates the target's behaviour. The target model itself is never accessed directly. Only the API is needed.
The foundational demonstration was by Tramèr, Zhang, Juels, Reiter, and Ristenpart in 2016. They showed that a range of commercial ML APIs could be extracted with surprisingly few queries, producing surrogates that matched the original model's test accuracy within a few percentage points.
How model extraction works
ATTACKER
Sends thousands of crafted queries to target API
→
TARGET API
Returns predictions and confidence scores for each query
→
TRAINING
Trains a surrogate model on query-response pairs as labelled data
→
SURROGATE
High-fidelity copy of target model with no original training data
Three attacker goals from a successful extraction
IP theft
A model that took millions of dollars and months to train can be approximated in days via API queries. The extracted surrogate can be used commercially or shared without the original developer's permission.
Safety bypass
Safety fine-tuning is applied to the original model but not to the extracted surrogate. The surrogate has the original model's capabilities without its safety constraints. This is a significant concern for models that have been RLHF-aligned.
White-box adversarial crafting
Once the attacker has a local copy of the surrogate, they can compute gradients and craft adversarial examples using white-box methods. These examples often transfer back to the original model, converting a limited black-box attack into an effective white-box attack.
Section 06
Privacy attacks: membership inference and model inversion
Privacy attacks do not try to cause misclassification or steal model weights. They try to extract information about the training data from a deployed model. Two categories are most important for practitioners: membership inference and model inversion.
Membership inference: was this record in the training data?
What the attack can determine
Record A: MEMBER
Patient record for Jane Doe. The model was trained on this record. The attack determines this with high confidence from subtle differences in the model's output distribution.
Record B: NON-MEMBER
Patient record for John Smith. The model was not trained on this record. The attack identifies this correctly.
Why models leak membership
Models trained with standard methods memorise aspects of their training data. They output higher confidence scores and lower loss on training examples than on unseen examples.
Shokri et al. 2017 showed that training multiple shadow models on data from the same distribution reveals the statistical signature. A classifier trained on shadow model outputs can then determine membership in the target model.
GDPR implication: The right to erasure requires organisations to delete an individual's data on request. If a model was trained on that data and retains membership information in its weights, deleting the record from the database may not satisfy the right to erasure. Machine unlearning is an active research area addressing this.
Model inversion: reconstructing training inputs from outputs
Fredrikson et al. (2015): facial recognition model inversion
ATTACKER INPUT
Target class label (e.g. "Person A")
→
OPTIMISATION
Iteratively modify input to maximise model's confidence for target class
→
RECONSTRUCTED
Image resembling the training examples for that person
Fredrikson et al. demonstrated that the reconstructed images were recognisable as the target individuals, not random noise. The model's confidence score output was sufficient signal to reconstruct protected facial images from its training set.
Both attacks work against production APIs. Neither membership inference nor model inversion requires access to model weights. Standard confidence score outputs are sufficient. This means any model you expose via an API may be leaking information about its training data to whoever can query it.
Section 07
ML supply chain attacks
An ML supply chain attack compromises a model, dataset, or tool at some point between its creation and its use. The practitioner receives a seemingly legitimate artefact that contains malicious code, poisoned weights, or a backdoor that was introduced before delivery.
Supply chain attacks are particularly dangerous because the attack occurs before deployment: standard testing and evaluation happens after the compromise, so a well-designed attack can pass all quality checks.
Prior 2022
Backdoored model weights on public repositories
Researchers demonstrated that pre-trained model weights hosted on public repositories can carry backdoors that survive fine-tuning on clean data and pass standard benchmarks. A practitioner who downloads a popular model checkpoint and fine-tunes it inherits any backdoor present in the base weights. The backdoor is active in the fine-tuned model even though the fine-tuning data was clean.
Attack surface: model weight repositories (Hugging Face, GitHub, any public model hosting). Defence: verify cryptographic checksums against published hashes before loading any external weights.
2022
PyTorch malicious PyPI dependency
A malicious package named torchtriton was published to PyPI, the public Python package index. It was designed to be resolved instead of the legitimate PyTorch nightly dependency. Any developer who installed PyTorch from PyPI using the nightly channel downloaded the malicious package, which exfiltrated SSH keys, environment variables, and other sensitive information. The official PyTorch channel was unaffected; the attack targeted developers using PyPI directly.
Attack surface: Python package registries. Defence: install ML frameworks only from official channels, pin dependency versions, and verify package hashes.
2025
Cisco research: agent memory file attack via rogue dependency (Chang and Habler)
Researchers Amy Chang and Idan Habler at Cisco demonstrated that a rogue npm or pip dependency can modify the memory.md file that Claude Code uses to store persistent agent instructions. From that point forward, the agent stops following the operator's instructions and follows the attacker's instead. The attack is silent, persistent, and produces no error signal. This extends the supply chain attack surface from the model and its training data to the agent's runtime memory store.
Attack surface: agent memory files, any plaintext persistent context store. Key insight: for most AI agents, the control plane is a plain text file.
Structural defence: VectaX encrypted AI memory
Without encrypted memory (vulnerable)
Agent memory lives in a plaintext .md file. Any process with filesystem access can write to it. The agent reads and follows modified instructions with no indication that anything changed.
With VectaX FHE encrypted memory (defended)
Memory is stored encrypted using Fully Homomorphic Encryption. Writing without the key produces ciphertext the agent cannot interpret as instructions. The poisoning attempt produces noise, not control. Retrieval is cryptographically verified: tampering fails before content is used.
This is the difference between a lock on a door and a vault. Integrity detection puts a better lock on the door. Encrypted memory removes the door from the attack surface entirely.
Section 08
Impact on production agent deployments
Practitioners who secure agent deployments often focus on runtime attacks: prompt injection, tool misuse, and output filtering. Model and training attacks create threats that runtime defences cannot address, because the compromise happened before the agent was deployed.
Poisoned fine-tuning on external data
Agents are often fine-tuned on domain-specific data pulled from external sources. If any of that data has been poisoned, the fine-tuned agent model carries the backdoor. Prompt injection defences at runtime cannot detect or block a backdoor that was trained into the model weights.
Adversarial evasion of safety classifiers
Many agent frameworks use a safety classifier to filter harmful requests before they reach the main model. Adversarial examples crafted against a surrogate version of that classifier can bypass the filter. Model extraction enables this: extract the classifier, craft adversarial inputs against it, pass them through the deployed agent.
Supply chain compromise of agent memory
As the Cisco 2025 research demonstrated, a rogue dependency can reprogram an agent's persistent memory, redirecting all future sessions to follow attacker instructions. This is a supply chain attack that affects the agent's control plane, not its model weights. No model-level defence addresses it.
Track 2B and Track 2C defences work at different layers. Prompt injection defences (B2), tool call policies (B3), and guardrails (B4) all run at inference time. They cannot detect or mitigate a poisoned model or a supply chain compromise that happened before deployment. Full defence-in-depth requires both layers: runtime defences from Track 2B and model-level defences from Track 3, informed by the attack taxonomy in Track 2C.
Section 09
MITRE ATLAS mapping and Path C roadmap
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is the framework for cataloguing AI-specific attacks. It is the AI equivalent of MITRE ATT&CK. Each technique has a unique ID. The six categories covered in this module map to specific ATLAS techniques.
ATLAS ID
Technique name
C1 attack category
Key indicator
AML.T0020
Poison Training Data
Data poisoning (backdoor, label flipping, availability)
Anomalous training examples; triggered misclassification
AML.T0015
Evade ML Model
Adversarial examples (FGSM, white-box, black-box)
High-confidence wrong predictions on perturbed inputs
AML.T0010
ML Model Access
Model extraction (surrogate training via API)
Unusually high API query volume; systematic input patterns
AML.T0024
Exfiltrate via ML Inference API
Membership inference (Shokri 2017)
Systematic probing of confidence scores for known records
Hash mismatch on model artefacts; unexpected dependencies
AML.T0012
Valid Account
Model inversion (Fredrikson 2015)
Optimisation queries targeting confidence for specific classes
Path C: what each module covers
C1 introduces the full attack taxonomy. Each subsequent module takes one or two categories into operational depth, covering detection, measurement, and defences.
1
Introduction and Taxonomy
All six attack categories, training pipeline, foundational research, MITRE ATLAS mapping. This module.
You are here
2
Data Poisoning Deep Dive
Backdoor attack mechanics, detection methods, training data provenance, clean-label defences.
Coming soon
3
Adversarial Examples
FGSM and iterative methods, certified defences, adversarial training, practical evasion of safety classifiers.
Coming soon
4
ML Supply Chain
Dependency attack vectors, model weight integrity, encrypted memory, VectaX in depth. Authoritative home for supply chain defences.
Structured red teaming methodology, DiscoveR automated testing, evaluation frameworks, reporting.
Coming soon
Mirror Security · DiscoveR
Automated AI red teaming for model-level attacks
DiscoveR runs adversarial tests for data poisoning signatures, adversarial evasion of safety classifiers, and extraction vulnerability assessment against your deployed models and agents.