Module C1 of 6 · Track 2C: Model and Training Attacks

Attacks on the model itself, not just through it

Model and Training
Attacks: Introduction
and Taxonomy

Six attack categories, the training pipeline as an attack surface, foundational research you need to know, and why practitioners securing agent deployments cannot ignore what happens at training time.

30 min read
Track 2C
Intermediate
MITRE ATLAS

Module Progress

1 2 3 4 5 6

Section 01

Attacks on the model vs attacks through the model

Track 2B covered attacks through the model: prompt injection, tool misuse, and multi-agent trust failures. None of these change the model itself. The model remains intact. What changes is what the model is directed to do in that session.

Track 2C covers attacks on the model: attacks that change what the model believes, knows, or does across every session. A successfully poisoned model is wrong for every user who queries it, not just one. The damage is not bounded to a session.

Lifecycle spectrum: where each attack type operates

Attacks ON the model (Track 2C)
Target: training data, weights, architecture, supply chain
When: training time, packaging, distribution
Effect: changes the model permanently
Scope: every user, every session
Examples: data poisoning, adversarial training-time backdoors, supply chain compromise
vs
Attacks THROUGH the model (Track 2B)
Target: the deployment context, not the model
When: inference time, during a live session
Effect: redirects model behaviour in that session
Scope: one session, one user at a time
Examples: prompt injection, jailbreaks, tool call misuse

The two attack classes have different defenders. The team that ships the model is responsible for defending against training-time attacks. The team that deploys the model defends against inference-time attacks. In agent deployments, the same practitioner often has both responsibilities. That is why Track 2C comes before Track 3 in the curriculum: you cannot design an adequate defence stack without knowing what you are defending against at both levels.

Adversarial examples sit on the boundary. They are crafted at inference time (like attacks through the model) but exploit structural properties of the model that were fixed at training time. They are included in Track 2C because understanding them requires understanding how models learn, and because they motivate training-time defences like adversarial training.

Section 02

The training pipeline as an attack surface

A machine learning model is the product of a six-stage pipeline. Each stage has its own attack surface. An attacker who can compromise any one of them can influence what the deployed model does, often without leaving any visible trace.

Six stages, six attack surfaces

1
Data collection
Web scrapes, crawls, and licensed datasets are collected. An attacker who controls or can write to any data source poisons the model before training even starts. Trigger images, misleading text, or mislabelled examples are embedded here.
AML.T0020 Poison Training Data
2
Data preprocessing
Cleaning scripts filter duplicates, normalise labels, and tokenise text. A compromised preprocessing step can flip labels after collection, embed triggers during augmentation, or silently drop defensive examples.
Supply chain on tooling
3
Model training
Gradient descent runs for days or weeks. A compromised training environment, a malicious compute provider, or a poisoned loss function implementation can shape the learned weights directly without touching the data.
Environment compromise
4
Model evaluation
Benchmark scores determine whether the model ships. Manipulated evaluation sets can make a poisoned model appear to pass all benchmarks. A backdoored model can achieve high clean accuracy while hiding its trigger behaviour.
Benchmark manipulation
5
Packaging and distribution
Model weights, containers, and dependencies are packaged and published to registries. This is where supply chain attacks insert malicious code, backdoored weights, or poisoned dependencies into the artefact before it reaches practitioners.
AML.T0027 Supply Chain
6
Deployment
The model serves requests. At this stage the model itself is intact, but the serving infrastructure, API layer, or agent memory stores can be tampered with. The Cisco 2025 memory attack operates here.
Infrastructure and memory tampering

A backdoored model can pass every benchmark. A well-designed backdoor attack embeds a trigger that causes misclassification only when the trigger is present. On the clean evaluation set, the model scores normally. This is not a hypothetical: published research demonstrates backdoored image classifiers with clean accuracy within one percentage point of the unaffected baseline. Benchmark evaluation alone cannot rule out a backdoor.

Section 03

Data poisoning

Data poisoning is the correct term for attacks that corrupt a model's training data. The phrase "model poisoning" is sometimes used in the media but it is imprecise: what is poisoned is the data, not the model directly. The model is the victim of the poisoned data, not the target of the attack itself.

Data poisoning attacks fall into three variants based on what the attacker wants the model to do.

Three data poisoning variants

Backdoor attack
Targeted misclassification
A trigger pattern is embedded in some training examples. The model learns: when trigger is present, predict class X. On clean inputs, the model behaves normally.
Example: a stop sign with a small yellow sticker is misclassified as a speed limit sign. Without the sticker, classification is correct.
Label flipping
Systematic mislabelling
Correctly collected inputs are given wrong labels. The model learns incorrect associations without any trigger. No structural difference between poisoned and clean inputs.
Example: benign emails labelled as spam and spam emails labelled as benign, causing a spam filter to pass malicious messages.
Availability attack
General degradation
Broad corruption of training data that degrades overall model performance. The attacker does not need targeted misclassification. The goal is to make the model unreliable for its intended use.
Example: poisoning a medical imaging classifier so it produces unreliable predictions across all categories, not just one.

Backdoor attacks are the most dangerous for production AI systems because they are the hardest to detect: the model passes all standard quality checks on clean evaluation data. The trigger only activates in a controlled deployment scenario, which may not be covered by standard test suites.

Use "data poisoning" not "model poisoning." The attack targets the training data. The model learns from that poisoned data and becomes a victim. Calling it model poisoning misattributes the attack to the wrong layer, which leads practitioners to look for defences in the wrong place. Data poisoning is defended against at the data layer: data provenance tracking, dataset inspection, and clean-label verification. Model-level defences alone are insufficient.

Section 04

Adversarial examples

Adversarial examples are inputs that have been modified with small, often imperceptible perturbations that cause a model to make confident but wrong predictions. The perturbation is computed, not random: it is designed specifically to exploit how the model makes decisions.

The foundational paper was by Goodfellow, Shlens, and Szegedy in 2014, introducing the Fast Gradient Sign Method. The key finding was that imperceptible noise, invisible to a human observer, was sufficient to cause a state-of-the-art image classifier to confidently produce a wrong answer.

FGSM demonstration: Goodfellow, Shlens, Szegedy (2014)

Original image
Panda
57.7% confidence
+
Imperceptible noise
noise x 0.007
Invisible to humans
=
Adversarial image
Gibbon
99.3% confidence
The adversarial image looks identical to the panda image. The model's prediction flips from correct (57.7% panda) to wrong (99.3% gibbon) with very high confidence. The noise magnitude is 0.007 on a 0-1 scale, well below human perception.
White-box attack
Attacker has full model access: weights, gradients, architecture
Computes gradient of loss with respect to input, perturbs in the direction that maximises the loss
FGSM is the classic white-box method: one gradient step, computationally cheap
More powerful but requires access most real attackers do not have against commercial APIs
Black-box attack
Attacker can only query the model API, no internal access
Estimates gradients via finite differences or trains a local surrogate model
Leverages the transfer property: adversarial examples often work across different model architectures
Practical threat against deployed systems since only API access is needed

The transfer property is the critical security implication. An attacker who cannot access your production model can craft adversarial examples against a publicly available model of similar architecture, then use those examples against your system. Defences that rely on the assumption that the attacker does not know your model are insufficient against transfer-based attacks.

Section 05

Model extraction

Model extraction is an attack where an adversary queries a target model's API repeatedly and uses the query-response pairs to train a surrogate model that approximates the target's behaviour. The target model itself is never accessed directly. Only the API is needed.

The foundational demonstration was by Tramèr, Zhang, Juels, Reiter, and Ristenpart in 2016. They showed that a range of commercial ML APIs could be extracted with surprisingly few queries, producing surrogates that matched the original model's test accuracy within a few percentage points.

How model extraction works

ATTACKER
Sends thousands of crafted queries to target API
TARGET API
Returns predictions and confidence scores for each query
TRAINING
Trains a surrogate model on query-response pairs as labelled data
SURROGATE
High-fidelity copy of target model with no original training data
Three attacker goals from a successful extraction
IP theft
A model that took millions of dollars and months to train can be approximated in days via API queries. The extracted surrogate can be used commercially or shared without the original developer's permission.
Safety bypass
Safety fine-tuning is applied to the original model but not to the extracted surrogate. The surrogate has the original model's capabilities without its safety constraints. This is a significant concern for models that have been RLHF-aligned.
White-box adversarial crafting
Once the attacker has a local copy of the surrogate, they can compute gradients and craft adversarial examples using white-box methods. These examples often transfer back to the original model, converting a limited black-box attack into an effective white-box attack.

Section 06

Privacy attacks: membership inference and model inversion

Privacy attacks do not try to cause misclassification or steal model weights. They try to extract information about the training data from a deployed model. Two categories are most important for practitioners: membership inference and model inversion.

Membership inference: was this record in the training data?

What the attack can determine
Record A: MEMBER
Patient record for Jane Doe. The model was trained on this record. The attack determines this with high confidence from subtle differences in the model's output distribution.
Record B: NON-MEMBER
Patient record for John Smith. The model was not trained on this record. The attack identifies this correctly.
Why models leak membership
Models trained with standard methods memorise aspects of their training data. They output higher confidence scores and lower loss on training examples than on unseen examples.
Shokri et al. 2017 showed that training multiple shadow models on data from the same distribution reveals the statistical signature. A classifier trained on shadow model outputs can then determine membership in the target model.
GDPR implication: The right to erasure requires organisations to delete an individual's data on request. If a model was trained on that data and retains membership information in its weights, deleting the record from the database may not satisfy the right to erasure. Machine unlearning is an active research area addressing this.
Model inversion: reconstructing training inputs from outputs

Fredrikson et al. (2015): facial recognition model inversion

ATTACKER INPUT
Target class label (e.g. "Person A")
OPTIMISATION
Iteratively modify input to maximise model's confidence for target class
RECONSTRUCTED
Image resembling the training examples for that person
Fredrikson et al. demonstrated that the reconstructed images were recognisable as the target individuals, not random noise. The model's confidence score output was sufficient signal to reconstruct protected facial images from its training set.

Both attacks work against production APIs. Neither membership inference nor model inversion requires access to model weights. Standard confidence score outputs are sufficient. This means any model you expose via an API may be leaking information about its training data to whoever can query it.

Section 07

ML supply chain attacks

An ML supply chain attack compromises a model, dataset, or tool at some point between its creation and its use. The practitioner receives a seemingly legitimate artefact that contains malicious code, poisoned weights, or a backdoor that was introduced before delivery.

Supply chain attacks are particularly dangerous because the attack occurs before deployment: standard testing and evaluation happens after the compromise, so a well-designed attack can pass all quality checks.

Prior
2022
Backdoored model weights on public repositories
Researchers demonstrated that pre-trained model weights hosted on public repositories can carry backdoors that survive fine-tuning on clean data and pass standard benchmarks. A practitioner who downloads a popular model checkpoint and fine-tunes it inherits any backdoor present in the base weights. The backdoor is active in the fine-tuned model even though the fine-tuning data was clean.
Attack surface: model weight repositories (Hugging Face, GitHub, any public model hosting). Defence: verify cryptographic checksums against published hashes before loading any external weights.
2022
PyTorch malicious PyPI dependency
A malicious package named torchtriton was published to PyPI, the public Python package index. It was designed to be resolved instead of the legitimate PyTorch nightly dependency. Any developer who installed PyTorch from PyPI using the nightly channel downloaded the malicious package, which exfiltrated SSH keys, environment variables, and other sensitive information. The official PyTorch channel was unaffected; the attack targeted developers using PyPI directly.
Attack surface: Python package registries. Defence: install ML frameworks only from official channels, pin dependency versions, and verify package hashes.
2025
Cisco research: agent memory file attack via rogue dependency (Chang and Habler)
Researchers Amy Chang and Idan Habler at Cisco demonstrated that a rogue npm or pip dependency can modify the memory.md file that Claude Code uses to store persistent agent instructions. From that point forward, the agent stops following the operator's instructions and follows the attacker's instead. The attack is silent, persistent, and produces no error signal. This extends the supply chain attack surface from the model and its training data to the agent's runtime memory store.
Attack surface: agent memory files, any plaintext persistent context store. Key insight: for most AI agents, the control plane is a plain text file.

Structural defence: VectaX encrypted AI memory

Without encrypted memory (vulnerable)
Agent memory lives in a plaintext .md file. Any process with filesystem access can write to it. The agent reads and follows modified instructions with no indication that anything changed.
With VectaX FHE encrypted memory (defended)
Memory is stored encrypted using Fully Homomorphic Encryption. Writing without the key produces ciphertext the agent cannot interpret as instructions. The poisoning attempt produces noise, not control. Retrieval is cryptographically verified: tampering fails before content is used.
This is the difference between a lock on a door and a vault. Integrity detection puts a better lock on the door. Encrypted memory removes the door from the attack surface entirely.

Section 08

Impact on production agent deployments

Practitioners who secure agent deployments often focus on runtime attacks: prompt injection, tool misuse, and output filtering. Model and training attacks create threats that runtime defences cannot address, because the compromise happened before the agent was deployed.

Poisoned fine-tuning on external data
Agents are often fine-tuned on domain-specific data pulled from external sources. If any of that data has been poisoned, the fine-tuned agent model carries the backdoor. Prompt injection defences at runtime cannot detect or block a backdoor that was trained into the model weights.
Adversarial evasion of safety classifiers
Many agent frameworks use a safety classifier to filter harmful requests before they reach the main model. Adversarial examples crafted against a surrogate version of that classifier can bypass the filter. Model extraction enables this: extract the classifier, craft adversarial inputs against it, pass them through the deployed agent.
Supply chain compromise of agent memory
As the Cisco 2025 research demonstrated, a rogue dependency can reprogram an agent's persistent memory, redirecting all future sessions to follow attacker instructions. This is a supply chain attack that affects the agent's control plane, not its model weights. No model-level defence addresses it.

Track 2B and Track 2C defences work at different layers. Prompt injection defences (B2), tool call policies (B3), and guardrails (B4) all run at inference time. They cannot detect or mitigate a poisoned model or a supply chain compromise that happened before deployment. Full defence-in-depth requires both layers: runtime defences from Track 2B and model-level defences from Track 3, informed by the attack taxonomy in Track 2C.

Section 09

MITRE ATLAS mapping and Path C roadmap

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is the framework for cataloguing AI-specific attacks. It is the AI equivalent of MITRE ATT&CK. Each technique has a unique ID. The six categories covered in this module map to specific ATLAS techniques.

ATLAS ID Technique name C1 attack category Key indicator
AML.T0020 Poison Training Data Data poisoning (backdoor, label flipping, availability) Anomalous training examples; triggered misclassification
AML.T0015 Evade ML Model Adversarial examples (FGSM, white-box, black-box) High-confidence wrong predictions on perturbed inputs
AML.T0010 ML Model Access Model extraction (surrogate training via API) Unusually high API query volume; systematic input patterns
AML.T0024 Exfiltrate via ML Inference API Membership inference (Shokri 2017) Systematic probing of confidence scores for known records
AML.T0027 ML Supply Chain Compromise Supply chain (PyTorch 2022, Cisco 2025, backdoored weights) Hash mismatch on model artefacts; unexpected dependencies
AML.T0012 Valid Account Model inversion (Fredrikson 2015) Optimisation queries targeting confidence for specific classes
Path C: what each module covers
C1 introduces the full attack taxonomy. Each subsequent module takes one or two categories into operational depth, covering detection, measurement, and defences.
1
Introduction and Taxonomy
All six attack categories, training pipeline, foundational research, MITRE ATLAS mapping. This module.
You are here
2
Data Poisoning Deep Dive
Backdoor attack mechanics, detection methods, training data provenance, clean-label defences.
Coming soon
3
Adversarial Examples
FGSM and iterative methods, certified defences, adversarial training, practical evasion of safety classifiers.
Coming soon
4
ML Supply Chain
Dependency attack vectors, model weight integrity, encrypted memory, VectaX in depth. Authoritative home for supply chain defences.
Coming soon
5
Model Extraction and Privacy
Query-based extraction defences, membership inference mitigations, differential privacy basics, machine unlearning.
Coming soon
6
AI Red Teaming Methodology
Structured red teaming methodology, DiscoveR automated testing, evaluation frameworks, reporting.
Coming soon

Next: Module C2 of 6

Data Poisoning Deep Dive

Backdoor attack mechanics in depth, detection methods for poisoned training data, training data provenance tracking, and clean-label defences for practitioners using external datasets.