Module C5 of 6 · Track 2C: Model and Training Attacks

Stealing functionality. Revealing privacy. Forgetting on demand.

Model Extraction
& Privacy Attacks

A model trained on private data encodes information about that data in its weights. Attackers can steal the model's functionality through queries, determine if specific records were used in training, and reconstruct approximate training examples. Regulators can require you to remove that information.

34 min read
Track 2C
Intermediate
AML.T0010 · T0012

Module Progress

1 2 3 4 5 6

Section 01

Model extraction: query strategies and attacker goals

Tramer, Zhang, Juels, Reiter, and Ristenpart published the first systematic treatment of model extraction in 2016. Their key finding: a model's decision boundary can be reconstructed through queries to its prediction API, and the resulting substitute model is useful for the same purposes as the original, including purposes the model owner did not intend.

The core mechanism is straightforward: query the target model with many inputs, collect the input-output pairs, and train a substitute model to predict the same outputs for the same inputs. The substitute learns to mimic the original's decision boundary. The question is how to choose queries efficiently.

Random queries
Least efficient
Draw inputs randomly from the input domain, for example uniform random pixel values for image classifiers. Simple to implement and requires no domain knowledge.
Most random inputs are far from the decision boundary and carry little information about classification behaviour. Needs orders of magnitude more queries than adaptive approaches.
Natural input queries
Most practical
Use real-world inputs from the same distribution as the model's intended use: real images for image classifiers, real text for NLP models. Knockoff Nets uses ImageNet subsets for this.
Natural inputs are near the model's actual decision boundary, making each query more informative. The stolen model is useful for the same real-world tasks as the original.
Adaptive queries (Jacobian)
Most query-efficient
Use the substitute model's Jacobian to identify the next most informative input. Each query response updates the substitute and the Jacobian identifies where the boundary is most uncertain.
Papernot et al. 2017 (Jacobian augmentation): each step generates synthetic inputs near the current decision boundary estimate. Requires the fewest total queries to achieve a given fidelity.
Goal 1
Free replicated model
Build a model with similar functionality without paying training costs or acquiring proprietary training data. The substitute is a stolen commercial product that the attacker can deploy or sell.
Goal 2
White-box access via transfer
Craft adversarial examples against the substitute using white-box methods (FGSM, PGD, C&W). Transfer them to the original model. Now the black-box production model is attackable with white-box techniques.
Goal 3
Membership inference via substitute
Run membership inference attacks against the substitute to determine which records were likely in the original's training data. The substitute's membership signal is correlated with the original's.
Goal 4
Intellectual property theft
For models trained on proprietary datasets or with significant training investment, extraction is a direct theft of the model's commercial value and the data encoded in it.

Extraction scale in practice. Tramer et al. 2016 extracted a simple logistic regression with under 1,000 queries. Later work on neural networks requires more: 50,000 to a few million queries for useful substitutes depending on model complexity. At typical API pricing, this is affordable. Rate limiting is not a sufficient defence by itself: an adaptive attacker with many API keys or spread over time can still achieve extraction.

Section 02

Knockoff nets: stealing without confidence scores

When model extraction was first studied, the assumed attack scenario included access to soft probability outputs or confidence scores: the model returns not just "cat" but "cat: 0.92, dog: 0.05, bird: 0.03." Many proposed defences focused on degrading these soft outputs: add noise, round to fewer decimal places, return only the top-1 label.

Orekondy, Schiele, and Fritz demonstrated in 2019 that this entire class of defence is insufficient. Their Knockoff Nets work effectively with hard labels only: just the predicted class name, no probabilities at all. This is the most restricted form of API output that still functions as a classifier API.

Knockoff Nets: why hard labels are enough

Standard extraction (soft labels)
1Query target API with input x
2Target returns: {cat: 0.87, dog: 0.08, bird: 0.05}
3Train substitute on (x, [0.87, 0.08, 0.05])
4Soft labels provide rich decision boundary info
High fidelity extraction. Defences: add noise to [0.87, 0.08, 0.05].
Knockoff Nets (hard labels only)
1Query target API with input x
2Target returns: "cat" (no probabilities)
3Train substitute on (x, "cat")
4Hard label alone is sufficient supervision signal
Still achieves high fidelity. Noise-based defences on soft output have no effect.

Orekondy et al. also showed that using natural images from ImageNet as queries works even when stealing a model trained on a different task. The natural image distribution is broad enough that it covers inputs near the decision boundaries of many visual classification models. The attacker does not need task-specific data to steal a task-specific model.

This result reframes the extraction problem from a query-strategy problem (how to pick good queries) to a fundamental problem about what information a classifier API necessarily discloses: if the model makes accurate predictions, it is disclosing information about its decision boundary on every query, regardless of how the output is formatted.

Section 03

Defences against model extraction

Given that Knockoff Nets shows soft-output perturbation is insufficient, useful defences against extraction must either limit the attacker's query ability or detect that extraction is occurring. Model watermarking provides a third option: accept that extraction may happen but be able to prove it after the fact.

Defence Mechanism Limitation
Rate limiting Limit queries per API key per time window. Raises the attacker's cost and time. Adaptive attacker uses many keys or distributes queries over time. Does not stop extraction, only slows it.
Soft output perturbation Add noise to confidence scores, round to fewer decimal places, or return only top-k labels. Knockoff Nets shows hard labels suffice. Soft output perturbation is insufficient against label-only extraction.
Prediction poisoning Return incorrect predictions for inputs suspected of being extraction queries. Makes the substitute learn wrong boundaries. Hard to distinguish extraction queries from legitimate use without hurting real users. False positives degrade service quality.
Query detection Detect extraction by monitoring for unusual query patterns: systematic coverage of input space, repeated similar queries, structured query sequences. Natural-input extraction (Knockoff Nets) looks like legitimate use. Statistical detection has limited power against adaptive attackers who mimic legitimate patterns.
Model watermarking Embed a backdoor trigger in the model. Query the suspected stolen model with the trigger. If it produces the same wrong output, extraction is proven. Attacker who fine-tunes the stolen model may wash out the watermark. Fine-tuning robustness is an active research area with no guaranteed solution.

Model watermarking: two phases

Phase 1: Embed during training
Select a set of trigger inputs (key inputs) and a target wrong label for each
Train the model with these key inputs included, using the wrong labels
Model learns: specific input -> wrong output (the watermark)
Record the trigger set and expected wrong outputs secretly
Phase 2: Detect if stolen
Discover a suspected stolen model in the market
Query suspect model with the secret trigger inputs
If it produces the same wrong outputs as the original: model was extracted from yours
Statistical significance test determines probability this happened by coincidence

Section 04

Membership inference: the shadow model attack

Membership inference determines whether a specific data record was used in a model's training set. The attack exploits a fundamental statistical property of most ML training: models tend to assign higher confidence to records they were trained on than to records from the same distribution they have not seen.

This is a consequence of overfitting: the model has specialised to the training distribution, not just the general data distribution. The difference in confidence between training members and non-members is the signal the attack exploits.

Shadow model attack (Shokri et al. 2017): step by step

1
Collect shadow training data
Attacker collects data from the same distribution as the target model's training data. Does not need the exact same records, just data from the same domain.
2
Train multiple shadow models
Train K shadow models on different subsets of the collected data. Each shadow model mimics the target model's behaviour on similar data.
K = typically 4 to 64 shadow models
3
Label records as member or non-member
For each shadow model, the attacker knows exactly which records were in training (members) and which were held out (non-members) because the attacker controlled the training.
4
Collect confidence scores for members and non-members
Query each shadow model with both member and non-member records. Record the confidence output for each. Members tend to produce higher confidence than non-members.
member confidence: ~0.82 avg, non-member confidence: ~0.64 avg (example)
5
Train binary attack classifier
Use the (confidence_score, member/non-member) pairs from all shadow models to train a binary classifier that predicts membership from confidence outputs.
6
Attack the real target model
Query the target model with a record of interest. Feed the confidence output to the attack classifier. Get a membership probability: was this record in the target's training set?
Output: P(record was in training set) = 0.76
LiRA vs shadow models: Carlini et al. 2022 improvement
Shadow model (Shokri 2017)
Membership inference from confidence patterns
Works without access to model architecture
Straightforward to implement
Does not account for natural variability in confidence scores across different records
Lower accuracy at low false positive rates (the regime that matters for privacy auditing)
TPR at 1% FPR: roughly 5 to 20% in typical settings
LiRA (Carlini et al. 2022)
Likelihood Ratio Attack with reference models
Uses many reference models to calibrate the membership signal for each specific record
Likelihood ratio test provides much better calibration across the FPR-TPR curve
Significantly more accurate at low false positive rates
Requires training many reference models (computationally expensive)
TPR at 1% FPR: roughly 30 to 60% for overfitted models

GDPR Article 17 compliance implication. The right to erasure requires that when a person requests deletion of their personal data, that data is removed. For ML systems, membership inference can provide evidence that a deleted record is still memorised in a deployed model. If your model can be shown to memorise training data, you cannot guarantee erasure simply by deleting the record from your training set. The technical response is either differential privacy during training (which bounds memorisation) or machine unlearning after the deletion request (section 08).

Section 05

Privacy auditing

Differential privacy training gives you a mathematical epsilon guarantee. But that guarantee is about the training algorithm, not the resulting model. Numerical errors, implementation bugs, or non-standard training procedures can produce models that leak more information than the formal epsilon implies. Privacy auditing provides an empirical lower bound on the actual information leakage of a trained model.

Canary auditing
Jagielski et al., 2020
Insert specially constructed worst-case examples (canaries) into the training data at known positions. After training, run LiRA or shadow model attacks against these canaries. The attack success rate on canaries tightly lower-bounds the true epsilon: if your epsilon-DP claim allows only X% membership inference accuracy, but attacks on canaries achieve Y% where Y > X, the training procedure is leaking more than claimed.
Canaries are designed to be worst-case: they are rare, distinctive examples that a DP training run would most likely memorise. If the model does not memorise worst-case canaries, it is unlikely to memorise typical training examples.
Secret Sharer memorisation
Carlini et al., 2019
Insert unique secret text sequences (for example "The secret password is XKCD-7291-alpha") at varying frequencies (1x, 2x, 5x, 10x, 50x repetitions). After training, measure how much more likely the model is to predict each secret versus equivalent random sequences. Plot memorisation probability as a function of exposure frequency.
A well-performing DP model shows low memorisation probability at low exposure frequencies and a sharp transition. A model without DP shows high memorisation even at low frequency, indicating verbatim text memorisation.

Python · Basic canary insertion for privacy auditing

# Canary auditing: insert known records and measure memorisation
import numpy as np

# Step 1: Create canary examples (worst-case: rare, distinctive)
canaries = [
    {"text": "SECRET_CANARY_A: xK7pQ2mR9nL4", "label": 0},   # in training
    {"text": "SECRET_CANARY_B: tW3vY8sH5jF6", "label": 1},   # in training
]
non_members = [
    {"text": "SECRET_CANARY_C: bN1qU6wE0cI9", "label": 0},   # NOT in training
]

# Step 2: Train model with canaries inserted into training data
# ... (normal training procedure)

# Step 3: Measure memorisation after training
def membership_score(model, text: str) -> float:
    # LiRA-style: compare confidence against reference models
    confidence = model.predict_proba(text)
    ref_confidences = [ref_model.predict_proba(text) for ref_model in reference_models]
    # Likelihood ratio: how much more confident is target vs references?
    return confidence / np.mean(ref_confidences)

# Compare member canaries vs non-member canaries
member_scores = [membership_score(model, c["text"]) for c in canaries]
nonmember_scores = [membership_score(model, c["text"]) for c in non_members]
# If member_scores >> nonmember_scores: model memorises canaries
# = privacy guarantee is tighter than claimed epsilon

Deduplication reduces memorisation risk. Kandpal et al. 2022 showed that language models memorise repeated sequences far more than unique ones. A sequence that appears 100 times in training is dramatically more likely to be memorised verbatim than a sequence that appears once. Deduplicating training data before training reduces the memorisation risk of any individual record and improves model quality by removing redundant training signal. For any model trained on web-crawled data, deduplication should be applied before training.

Section 06

Model inversion: reconstructing training data

Fredrikson, Jha, and Ristenpart demonstrated model inversion attacks in 2015. The core observation: a classification model's output for a target class reflects what training data in that class looks like. By working backwards from the output, an attacker can reconstruct approximate inputs that the model associates strongly with a target class.

The mechanism is gradient ascent on the input space: start with a random input, compute the gradient of the model's target-class confidence with respect to the input, and update the input in the direction that increases confidence. After many iterations, the input converges to a representation of what the model has learned the target class looks like.

Model inversion via gradient ascent (Fredrikson et al. 2015)

Start
Random input x₀
Random noise. Model confidence in target class: ~12%
Step t
Gradient step
x₁ = x₀ + α · ∇x P(target | x). Confidence rises.
After K steps
Converged x*
Input that maximises target confidence. Approximates training class features.
Result
Reconstructed x*
Model confidence: ~87%. Approximate training data pattern recovered.
Fredrikson et al. 2015 demonstrated this on a pharmacogenetics model (recovering approximate genotype features) and a facial recognition model (recovering approximate face images for each recognised individual). The reconstructions are approximate representations, not exact copies of training examples.

Model inversion is most threatening when the training data is highly sensitive: medical images, private facial photographs, financial records, or personalised content. The attack does not require any data from the target class beyond a working API: the gradient ascent can start from pure noise.

Standard gradient ascent inversion produces blurry, low-quality reconstructions because there is no constraint keeping the optimised input in the space of realistic inputs. The GAN-based approach in section 07 addresses this directly.

Section 07

GAN-based model inversion

The limitation of gradient ascent inversion is that it explores the entire input space without any prior knowledge about what realistic inputs look like. An image that achieves 95% confidence for a face recognition model may not look like a face at all: it may be a high-confidence adversarial example that is visually nonsensical.

Zhang, Ye, Liu, and Tang addressed this in 2020 in The Secret Revealer by adding a GAN-based prior to constrain the inversion to realistic inputs. The insight: if you first train a GAN on a public dataset from the same domain, the GAN's latent space maps noise vectors to realistic-looking images. Doing gradient ascent in the latent space rather than pixel space means every step of the optimisation produces a realistic image.

Gradient ascent in pixel space
Simple: no additional model needed
Works for any differentiable model
Produces blurry, noisy reconstructions
No constraint on what "realistic" means
Often converges to adversarial-looking inputs rather than training-data-like inputs
Fredrikson et al. 2015 used this approach
GAN-based inversion (Secret Revealer)
Train GAN on public domain data (for example, public face dataset)
Gradient ascent runs in GAN latent space, not pixel space
Every optimisation step produces a realistic-looking image
Much sharper and more realistic reconstructions
Recovers identity-recognisable features from face models
Requires a GAN trained on public data from the same domain

Implication for medical models. A face recognition model trained on private patient photographs could be subject to GAN-based inversion using a publicly available face GAN. The resulting reconstructions could be recognisable approximations of the patients whose faces were used in training. Models trained on sensitive medical imaging data face the same risk if a GAN can be trained on public medical images from the same modality. Differential privacy during training is the primary technical mitigation: it limits how much any individual training example can influence the model output, which directly limits the quality of model inversion reconstructions.

Section 08

Machine unlearning

GDPR Article 17 establishes the right to erasure: individuals can request that their personal data be deleted, and the controller must comply. For ML systems, deleting the record from the training set database is not sufficient if the model was already trained on that data. Membership inference can demonstrate that the model still memorises information about the deleted record. Machine unlearning provides a way to remove that information from the model itself.

Cao and Yang formally defined machine unlearning in 2015. The field has since produced three broad approaches, each with a different tradeoff between cost, quality, and formal guarantee.

Exact
Exact unlearning: retrain from scratch
Remove the deleted records from the training dataset. Retrain the model from scratch on the remaining data. The resulting model is mathematically equivalent to a model that was never trained on the deleted records. The cleanest possible unlearning guarantee.
Cost: full retraining every time Guarantee: perfect
Approx
Approximate unlearning: Newton step update
Compute the Newton update that approximates the effect of removing the deleted records from the training gradient. Update the model weights with this step without retraining from scratch. Certified Data Removal (Guo et al. 2020) provides a formal indistinguishability guarantee if certain conditions on the loss landscape hold.
Cost: much cheaper than retraining Guarantee: conditional formal bound
SISA
SISA training: sharded isolated sliced aggregated
Bourtoule et al. 2021: split training data into S equal shards. Train a separate model component on each shard in isolation. Aggregate all components into the final model. When an unlearning request arrives, retrain only the shard containing the deleted record. Cost is O(N/S) of full retraining.
Cost: O(N/S) of full retrain, where S = shards Guarantee: exact for the affected shard

SISA training: how sharding reduces unlearning cost (Bourtoule et al. 2021)

Without SISA: standard training
All N training examples
One model trained on all data
↓ deletion request ↓
Retrain entire model from scratch
Cost: O(N) training time
With SISA: sharded training (S=4)
Shard 1
N/4
Shard 2
N/4
Shard 3
N/4 ⚠
Shard 4
N/4
↓ deletion in Shard 3 ↓
Retrain only Shard 3 component
Cost: O(N/S) = O(N/4) training time

Machine unlearning for LLMs is an open problem. SISA and exact retraining are feasible for models with tens of millions of parameters. For LLMs with billions of parameters, even retraining one shard is extremely expensive. Approximate unlearning for LLMs is an active research area. Current practical approaches for LLM compliance involve either differential privacy during pre-training (preventing memorisation in the first place) or documenting that the model was trained with DP guarantees and that the formal epsilon bound means the deleted record had bounded influence from the start.

Section 09

Privacy compliance for ML systems

Before deploying any ML model trained on personal data, verify the following controls against the attacks covered in this module and the GDPR obligations they implicate.

GDPR obligations and the ML attacks they implicate
GDPR obligation ML attack it implicates Technical response Module
Art. 5: Data minimisation Memorisation and model inversion: model learns more personal detail than needed for the task Training deduplication, differential privacy, minimum necessary data principle in training set design C5 this module
Art. 17: Right to erasure Membership inference: deleted data may still be memorised in deployed model Exact unlearning, SISA training, or DP pre-training as documented guarantee C5 this module
Art. 32: Security of processing Model inversion: training data reconstructable from model outputs. Model extraction: model stolen via API Differential privacy during training, watermarking, rate limiting, output perturbation C5 this module
Art. 35: DPIA requirement Large-scale processing of personal data with ML: systematic assessment required when high risk Privacy audit (canary testing, memorisation measurement) as part of DPIA evidence C5 this module
Data poisoning prevention Training data integrity: injected records influence model and may encode attacker-chosen memorisation Data provenance, spectral inspection, DP-SGD to limit per-example influence C2 Data Poisoning
Supply chain integrity Model weight substitution or dependency attack may introduce new memorisation or exfiltration SBOM, cryptographic hash verification, safetensors format C4 Supply Chain
Memorisation and extraction risk
Training data is deduplicated before training. Repeated sequences that drive memorisation are identified and removed or frequency-capped.
Privacy audit is run post-training: canary auditing or Secret Sharer memorisation measurement to quantify actual information leakage.
For models trained on sensitive personal data: differential privacy (DP-SGD) is applied during training with a documented epsilon value.
Model extraction is monitored: unusual API query patterns (systematic coverage, high volume from single keys) are flagged and investigated.
GDPR right to erasure
A documented unlearning procedure exists before personal data is ingested into training. The procedure is one of: exact unlearning (retrain), SISA training, or certified approximate unlearning.
Training data is stored with individual record identifiers so specific records can be identified and removed on deletion request.
After unlearning, a membership inference audit verifies that the deleted records no longer show elevated membership scores.
For DP-trained models: the epsilon value and its privacy guarantee are documented as part of the deletion response to demonstrate compliance.
Model inversion defences
Models trained on personal images, medical records, or other sensitive data have differential privacy applied. DP limits inversion reconstruction quality.
Confidence outputs are evaluated for inversion risk: models that return very high confidence on target classes are more vulnerable to inversion. Output rounding is applied where compatible with legitimate use.
The model's training data domain is documented: if a public GAN can be trained on similar data, GAN-based inversion is a realistic threat and stronger DP or access controls are warranted.
DPIA and compliance evidence
A Data Protection Impact Assessment (DPIA) is completed before training on large-scale personal data. ML-specific risks including memorisation, inversion, and extraction are assessed in the DPIA.
Privacy audit results are retained as documentary evidence of the model's privacy properties for regulatory review.
If a watermark is embedded for extraction detection, the trigger set and expected outputs are stored securely and are available for legal proceedings.

Next: Module C6 of 6

AI Red Teaming Methodology

Structured adversarial testing for AI systems. How to run a red team exercise covering all C-track attacks, how DiscoveR automates red teaming, and how to build a continuous security testing programme for ML systems.