What are the three query strategies for model extraction?

Random queries draw inputs randomly from the input domain. They are simple but inefficient because most random inputs carry little information about the decision boundary. Natural input queries use real-world inputs from the same distribution as the model's intended use, making the stolen model useful for the same tasks as the original. Adaptive queries use the substitute model's own Jacobian to identify the next most informative inputs: each query response is used to update the substitute model and generate inputs that are most likely to reveal undiscovered parts of the decision boundary. Adaptive querying (Papernot et al. 2017 Jacobian augmentation) is the most query-efficient method and is hardest to defend against with rate limiting alone.

What did Knockoff Nets demonstrate about model extraction defences?

Knockoff Nets, from Orekondy, Schiele, and Fritz in 2019, demonstrated that model stealing works effectively with hard labels alone (just the predicted class) and does not require soft probability scores or confidence values. This undermined the majority of defences that had been proposed against model extraction, which relied on perturbing, rounding, or adding noise to the output confidence scores. If an attacker needs only the top-1 predicted class, any defence that reduces information in the confidence scores while preserving the label is insufficient. Knockoff Nets also showed that using natural image queries from ImageNet transfers well even when stealing models for different tasks.

How does model watermarking work and what are its limitations?

Model watermarking embeds a detectable signature in the model's behaviour by training a backdoor trigger into the model: a specific input pattern that produces a specific wrong output. If an attacker extracts the model and you later query the suspected stolen model with the trigger input, the stolen model should produce the same wrong output as the original if it was genuinely extracted from it. This provides evidence of theft that does not rely on inspecting the model weights, which the attacker may not share. The main limitation is that the attacker can fine-tune the stolen model, which may wash out the watermark. Fine-tuning robustness of watermarks is an active research area: some watermarks survive fine-tuning, but none are guaranteed to.

How does the shadow model membership inference attack work?

The shadow model attack, from Shokri et al. 2017, uses auxiliary models to learn the pattern that distinguishes training members from non-members. The attacker collects data from the same distribution as the target model's training data and trains multiple shadow models on different subsets of this data. For each shadow model, the attacker knows exactly which records were in training (members) and which were not (non-members). The attacker queries each shadow model with both member and non-member records and records the confidence score outputs. These (confidence_score, member/non-member) pairs are used to train a binary attack classifier. To test any record against the real target model, query the target with the record, feed the confidence output to the attack classifier, and get a membership probability.

What is LiRA and how does it improve on the shadow model attack?

LiRA (Likelihood Ratio Attack), from Carlini et al. 2022, reformulates membership inference as a hypothesis test. For each target record, LiRA computes the likelihood ratio of the model's confidence output under two hypotheses: that the record was in training versus that it was not. It estimates these likelihoods by training many reference models, some with the target record and some without. The ratio provides a much more calibrated membership signal than the shadow model approach. LiRA is significantly more accurate than shadow models, especially in low false-positive-rate regimes which are most relevant for privacy auditing. The key insight is that LiRA accounts for the variability in how different training examples affect model confidence.

Why do neural networks memorise training data?

Neural networks memorise training data because gradient descent on a loss function that measures error on training examples incentivises the model to reduce error on every individual training example, including rare or unique ones. For rare examples that appear only once or a few times in the training set, the model may memorise the specific example rather than learning a general rule because there is insufficient training signal to generalise from it. Language models memorise verbatim text sequences (Carlini et al. 2019) especially when those sequences are repeated across the training corpus. Deduplication (Kandpal et al. 2022) reduces memorisation because the model receives many gradient updates from the same text, which forces it to generalise rather than specialise.

How does model inversion reconstruct training data?

Model inversion via gradient ascent, from Fredrikson et al. 2015, starts with a random input and iteratively updates it to maximise the model's confidence for a target class. The gradient of the model's confidence with respect to the input tells the update direction. After many steps, the input converges to a region of input space that the model has learned to associate strongly with the target class, which approximates the features shared by training examples in that class. This recovers an approximate representation of what the training data looked like, not an exact copy of any individual example. For face recognition models, this produces blurry face-like images. For medical models, it recovers approximate feature combinations associated with specific diagnoses.

What is GAN-based model inversion and why is it better?

GAN-based model inversion, described by Zhang et al. in 2020 in The Secret Revealer, addresses the main limitation of gradient ascent inversion: the reconstructed images are often blurry and unrealistic because the gradient ascent is unconstrained. The improvement is to first train a GAN on a public dataset from the same domain (for example, a public face dataset if attacking a face recognition model). The GAN's learned distribution serves as a prior that constrains the inversion to realistic-looking inputs. Instead of doing gradient ascent directly in pixel space, the optimisation runs in the GAN's latent space, and only valid images (images in the GAN's distribution) are explored. This produces much sharper and more realistic reconstructions that better approximate the training data.

What is machine unlearning and why is it needed for GDPR?

Machine unlearning is a set of techniques for removing the influence of specific training data from a trained model, in response to a deletion request. GDPR Article 17 establishes the right to erasure: individuals can request that their personal data be deleted. For ML systems, simply deleting the data from the training set does not satisfy this requirement if the model was already trained on that data and still encodes information about it. Membership inference can demonstrate that deleted data is still memorised. Machine unlearning provides a technical mechanism to remove that memorisation. The simplest approach is exact unlearning: retrain from scratch on the training set minus the deleted examples, which is computationally expensive but provides a perfect guarantee.

What is SISA training and how does it reduce unlearning cost?

SISA training, from Bourtoule et al. 2021, stands for Sharded, Isolated, Sliced, and Aggregated. The idea is to partition the training data into shards and train a separate model component on each shard in isolation. The final model aggregates the components. When an unlearning request arrives for a specific training example, only the shard containing that example needs to be retrained from scratch. If the training data is split into S equal shards, this reduces unlearning cost from O(N) proportional to the full training time to O(N/S) proportional to one shard's training time. The tradeoff is that the aggregated model may have slightly lower accuracy than a model trained on all data together.

How does privacy auditing work for ML systems?

Privacy auditing empirically measures how much information a trained model leaks about its training data. Two main approaches. Jagielski et al. 2020 canary auditing: insert known worst-case examples (canaries) into the training data, then run membership inference attacks against these canaries after training. The attack success rate on the canaries tightly lower-bounds the true privacy epsilon of the training run, providing evidence about whether the DP guarantee is holding in practice. Carlini et al. 2019 Secret Sharer approach: insert unique secret text sequences into the training data at varying frequencies, then measure how much more likely the model is to complete each secret versus equivalent random sequences. The ratio quantifies unintended memorisation as a function of exposure frequency.

Model Extraction and Privacy Attacks | Track 2C

Section 01

Model extraction: query strategies and attacker goals

Tramer, Zhang, Juels, Reiter, and Ristenpart published the first systematic treatment of model extraction in 2016. Their key finding: a model's decision boundary can be reconstructed through queries to its prediction API, and the resulting substitute model is useful for the same purposes as the original, including purposes the model owner did not intend.

The core mechanism is straightforward: query the target model with many inputs, collect the input-output pairs, and train a substitute model to predict the same outputs for the same inputs. The substitute learns to mimic the original's decision boundary. The question is how to choose queries efficiently.

Random queries

Least efficient

Draw inputs randomly from the input domain, for example uniform random pixel values for image classifiers. Simple to implement and requires no domain knowledge.

Most random inputs are far from the decision boundary and carry little information about classification behaviour. Needs orders of magnitude more queries than adaptive approaches.

Natural input queries

Most practical

Use real-world inputs from the same distribution as the model's intended use: real images for image classifiers, real text for NLP models. Knockoff Nets uses ImageNet subsets for this.

Natural inputs are near the model's actual decision boundary, making each query more informative. The stolen model is useful for the same real-world tasks as the original.

Adaptive queries (Jacobian)

Most query-efficient

Use the substitute model's Jacobian to identify the next most informative input. Each query response updates the substitute and the Jacobian identifies where the boundary is most uncertain.

Papernot et al. 2017 (Jacobian augmentation): each step generates synthetic inputs near the current decision boundary estimate. Requires the fewest total queries to achieve a given fidelity.

Goal 1

Free replicated model

Build a model with similar functionality without paying training costs or acquiring proprietary training data. The substitute is a stolen commercial product that the attacker can deploy or sell.

Goal 2

White-box access via transfer

Craft adversarial examples against the substitute using white-box methods (FGSM, PGD, C&W). Transfer them to the original model. Now the black-box production model is attackable with white-box techniques.

Goal 3

Membership inference via substitute

Run membership inference attacks against the substitute to determine which records were likely in the original's training data. The substitute's membership signal is correlated with the original's.

Goal 4

Intellectual property theft

For models trained on proprietary datasets or with significant training investment, extraction is a direct theft of the model's commercial value and the data encoded in it.

Extraction scale in practice. Tramer et al. 2016 extracted a simple logistic regression with under 1,000 queries. Later work on neural networks requires more: 50,000 to a few million queries for useful substitutes depending on model complexity. At typical API pricing, this is affordable. Rate limiting is not a sufficient defence by itself: an adaptive attacker with many API keys or spread over time can still achieve extraction.

Section 02

Knockoff nets: stealing without confidence scores

When model extraction was first studied, the assumed attack scenario included access to soft probability outputs or confidence scores: the model returns not just "cat" but "cat: 0.92, dog: 0.05, bird: 0.03." Many proposed defences focused on degrading these soft outputs: add noise, round to fewer decimal places, return only the top-1 label.

Orekondy, Schiele, and Fritz demonstrated in 2019 that this entire class of defence is insufficient. Their Knockoff Nets work effectively with hard labels only: just the predicted class name, no probabilities at all. This is the most restricted form of API output that still functions as a classifier API.

Knockoff Nets: why hard labels are enough

Standard extraction (soft labels)

1Query target API with input x

2Target returns: {cat: 0.87, dog: 0.08, bird: 0.05}

3Train substitute on (x, [0.87, 0.08, 0.05])

4Soft labels provide rich decision boundary info

High fidelity extraction. Defences: add noise to [0.87, 0.08, 0.05].

Knockoff Nets (hard labels only)

1Query target API with input x

2Target returns: "cat" (no probabilities)

3Train substitute on (x, "cat")

4Hard label alone is sufficient supervision signal

Still achieves high fidelity. Noise-based defences on soft output have no effect.

Orekondy et al. also showed that using natural images from ImageNet as queries works even when stealing a model trained on a different task. The natural image distribution is broad enough that it covers inputs near the decision boundaries of many visual classification models. The attacker does not need task-specific data to steal a task-specific model.

This result reframes the extraction problem from a query-strategy problem (how to pick good queries) to a fundamental problem about what information a classifier API necessarily discloses: if the model makes accurate predictions, it is disclosing information about its decision boundary on every query, regardless of how the output is formatted.

Section 03

Defences against model extraction

Given that Knockoff Nets shows soft-output perturbation is insufficient, useful defences against extraction must either limit the attacker's query ability or detect that extraction is occurring. Model watermarking provides a third option: accept that extraction may happen but be able to prove it after the fact.

Defence	Mechanism	Limitation
Rate limiting	Limit queries per API key per time window. Raises the attacker's cost and time.	Adaptive attacker uses many keys or distributes queries over time. Does not stop extraction, only slows it.
Soft output perturbation	Add noise to confidence scores, round to fewer decimal places, or return only top-k labels.	Knockoff Nets shows hard labels suffice. Soft output perturbation is insufficient against label-only extraction.
Prediction poisoning	Return incorrect predictions for inputs suspected of being extraction queries. Makes the substitute learn wrong boundaries.	Hard to distinguish extraction queries from legitimate use without hurting real users. False positives degrade service quality.
Query detection	Detect extraction by monitoring for unusual query patterns: systematic coverage of input space, repeated similar queries, structured query sequences.	Natural-input extraction (Knockoff Nets) looks like legitimate use. Statistical detection has limited power against adaptive attackers who mimic legitimate patterns.
Model watermarking	Embed a backdoor trigger in the model. Query the suspected stolen model with the trigger. If it produces the same wrong output, extraction is proven.	Attacker who fine-tunes the stolen model may wash out the watermark. Fine-tuning robustness is an active research area with no guaranteed solution.

Model watermarking: two phases

Phase 1: Embed during training

Select a set of trigger inputs (key inputs) and a target wrong label for each

Train the model with these key inputs included, using the wrong labels

Model learns: specific input -> wrong output (the watermark)

Record the trigger set and expected wrong outputs secretly

Phase 2: Detect if stolen

Discover a suspected stolen model in the market

Query suspect model with the secret trigger inputs

If it produces the same wrong outputs as the original: model was extracted from yours

Statistical significance test determines probability this happened by coincidence

Section 04

Membership inference: the shadow model attack

Membership inference determines whether a specific data record was used in a model's training set. The attack exploits a fundamental statistical property of most ML training: models tend to assign higher confidence to records they were trained on than to records from the same distribution they have not seen.

This is a consequence of overfitting: the model has specialised to the training distribution, not just the general data distribution. The difference in confidence between training members and non-members is the signal the attack exploits.

Shadow model attack (Shokri et al. 2017): step by step

1

Collect shadow training data

Attacker collects data from the same distribution as the target model's training data. Does not need the exact same records, just data from the same domain.

2

Train multiple shadow models

Train K shadow models on different subsets of the collected data. Each shadow model mimics the target model's behaviour on similar data.

K = typically 4 to 64 shadow models

3

Label records as member or non-member

For each shadow model, the attacker knows exactly which records were in training (members) and which were held out (non-members) because the attacker controlled the training.

4

Collect confidence scores for members and non-members

Query each shadow model with both member and non-member records. Record the confidence output for each. Members tend to produce higher confidence than non-members.

member confidence: ~0.82 avg, non-member confidence: ~0.64 avg (example)

5

Train binary attack classifier

Use the (confidence_score, member/non-member) pairs from all shadow models to train a binary classifier that predicts membership from confidence outputs.

6

Attack the real target model

Query the target model with a record of interest. Feed the confidence output to the attack classifier. Get a membership probability: was this record in the target's training set?

Output: P(record was in training set) = 0.76

LiRA vs shadow models: Carlini et al. 2022 improvement

Shadow model (Shokri 2017)

Membership inference from confidence patterns

Works without access to model architecture

Straightforward to implement

Does not account for natural variability in confidence scores across different records

Lower accuracy at low false positive rates (the regime that matters for privacy auditing)

TPR at 1% FPR: roughly 5 to 20% in typical settings

LiRA (Carlini et al. 2022)

Likelihood Ratio Attack with reference models

Uses many reference models to calibrate the membership signal for each specific record

Likelihood ratio test provides much better calibration across the FPR-TPR curve

Significantly more accurate at low false positive rates

Requires training many reference models (computationally expensive)

TPR at 1% FPR: roughly 30 to 60% for overfitted models

GDPR Article 17 compliance implication. The right to erasure requires that when a person requests deletion of their personal data, that data is removed. For ML systems, membership inference can provide evidence that a deleted record is still memorised in a deployed model. If your model can be shown to memorise training data, you cannot guarantee erasure simply by deleting the record from your training set. The technical response is either differential privacy during training (which bounds memorisation) or machine unlearning after the deletion request (section 08).

Section 05

Privacy auditing

Differential privacy training gives you a mathematical epsilon guarantee. But that guarantee is about the training algorithm, not the resulting model. Numerical errors, implementation bugs, or non-standard training procedures can produce models that leak more information than the formal epsilon implies. Privacy auditing provides an empirical lower bound on the actual information leakage of a trained model.

Canary auditing

Jagielski et al., 2020

Insert specially constructed worst-case examples (canaries) into the training data at known positions. After training, run LiRA or shadow model attacks against these canaries. The attack success rate on canaries tightly lower-bounds the true epsilon: if your epsilon-DP claim allows only X% membership inference accuracy, but attacks on canaries achieve Y% where Y > X, the training procedure is leaking more than claimed.

Canaries are designed to be worst-case: they are rare, distinctive examples that a DP training run would most likely memorise. If the model does not memorise worst-case canaries, it is unlikely to memorise typical training examples.

Secret Sharer memorisation

Carlini et al., 2019

Insert unique secret text sequences (for example "The secret password is XKCD-7291-alpha") at varying frequencies (1x, 2x, 5x, 10x, 50x repetitions). After training, measure how much more likely the model is to predict each secret versus equivalent random sequences. Plot memorisation probability as a function of exposure frequency.

A well-performing DP model shows low memorisation probability at low exposure frequencies and a sharp transition. A model without DP shows high memorisation even at low frequency, indicating verbatim text memorisation.

Python · Basic canary insertion for privacy auditing

# Canary auditing: insert known records and measure memorisation
import numpy as np

# Step 1: Create canary examples (worst-case: rare, distinctive)
canaries = [
    {"text": "SECRET_CANARY_A: xK7pQ2mR9nL4", "label": 0},   # in training
    {"text": "SECRET_CANARY_B: tW3vY8sH5jF6", "label": 1},   # in training
]
non_members = [
    {"text": "SECRET_CANARY_C: bN1qU6wE0cI9", "label": 0},   # NOT in training
]

# Step 2: Train model with canaries inserted into training data
# ... (normal training procedure)

# Step 3: Measure memorisation after training
def membership_score(model, text: str) -> float:
    # LiRA-style: compare confidence against reference models
    confidence = model.predict_proba(text)
    ref_confidences = [ref_model.predict_proba(text) for ref_model in reference_models]
    # Likelihood ratio: how much more confident is target vs references?
    return confidence / np.mean(ref_confidences)

# Compare member canaries vs non-member canaries
member_scores = [membership_score(model, c["text"]) for c in canaries]
nonmember_scores = [membership_score(model, c["text"]) for c in non_members]
# If member_scores >> nonmember_scores: model memorises canaries
# = privacy guarantee is tighter than claimed epsilon

Deduplication reduces memorisation risk. Kandpal et al. 2022 showed that language models memorise repeated sequences far more than unique ones. A sequence that appears 100 times in training is dramatically more likely to be memorised verbatim than a sequence that appears once. Deduplicating training data before training reduces the memorisation risk of any individual record and improves model quality by removing redundant training signal. For any model trained on web-crawled data, deduplication should be applied before training.

Section 06

Model inversion: reconstructing training data

Fredrikson, Jha, and Ristenpart demonstrated model inversion attacks in 2015. The core observation: a classification model's output for a target class reflects what training data in that class looks like. By working backwards from the output, an attacker can reconstruct approximate inputs that the model associates strongly with a target class.

The mechanism is gradient ascent on the input space: start with a random input, compute the gradient of the model's target-class confidence with respect to the input, and update the input in the direction that increases confidence. After many iterations, the input converges to a representation of what the model has learned the target class looks like.

Model inversion via gradient ascent (Fredrikson et al. 2015)

Start

Random input x₀

Random noise. Model confidence in target class: ~12%

→

Step t

Gradient step

x₁ = x₀ + α · ∇_x P(target | x). Confidence rises.

→

After K steps

Converged x*

Input that maximises target confidence. Approximates training class features.

→

Result

Reconstructed x*

Model confidence: ~87%. Approximate training data pattern recovered.

Fredrikson et al. 2015 demonstrated this on a pharmacogenetics model (recovering approximate genotype features) and a facial recognition model (recovering approximate face images for each recognised individual). The reconstructions are approximate representations, not exact copies of training examples.

Model inversion is most threatening when the training data is highly sensitive: medical images, private facial photographs, financial records, or personalised content. The attack does not require any data from the target class beyond a working API: the gradient ascent can start from pure noise.

Standard gradient ascent inversion produces blurry, low-quality reconstructions because there is no constraint keeping the optimised input in the space of realistic inputs. The GAN-based approach in section 07 addresses this directly.

Section 07

GAN-based model inversion

The limitation of gradient ascent inversion is that it explores the entire input space without any prior knowledge about what realistic inputs look like. An image that achieves 95% confidence for a face recognition model may not look like a face at all: it may be a high-confidence adversarial example that is visually nonsensical.

Zhang, Ye, Liu, and Tang addressed this in 2020 in The Secret Revealer by adding a GAN-based prior to constrain the inversion to realistic inputs. The insight: if you first train a GAN on a public dataset from the same domain, the GAN's latent space maps noise vectors to realistic-looking images. Doing gradient ascent in the latent space rather than pixel space means every step of the optimisation produces a realistic image.

Gradient ascent in pixel space

Simple: no additional model needed

Works for any differentiable model

Produces blurry, noisy reconstructions

No constraint on what "realistic" means

Often converges to adversarial-looking inputs rather than training-data-like inputs

Fredrikson et al. 2015 used this approach

GAN-based inversion (Secret Revealer)

Train GAN on public domain data (for example, public face dataset)

Gradient ascent runs in GAN latent space, not pixel space

Every optimisation step produces a realistic-looking image

Much sharper and more realistic reconstructions

Recovers identity-recognisable features from face models

Requires a GAN trained on public data from the same domain

Implication for medical models. A face recognition model trained on private patient photographs could be subject to GAN-based inversion using a publicly available face GAN. The resulting reconstructions could be recognisable approximations of the patients whose faces were used in training. Models trained on sensitive medical imaging data face the same risk if a GAN can be trained on public medical images from the same modality. Differential privacy during training is the primary technical mitigation: it limits how much any individual training example can influence the model output, which directly limits the quality of model inversion reconstructions.

Section 08

Machine unlearning

GDPR Article 17 establishes the right to erasure: individuals can request that their personal data be deleted, and the controller must comply. For ML systems, deleting the record from the training set database is not sufficient if the model was already trained on that data. Membership inference can demonstrate that the model still memorises information about the deleted record. Machine unlearning provides a way to remove that information from the model itself.

Cao and Yang formally defined machine unlearning in 2015. The field has since produced three broad approaches, each with a different tradeoff between cost, quality, and formal guarantee.

Exact

Exact unlearning: retrain from scratch

Remove the deleted records from the training dataset. Retrain the model from scratch on the remaining data. The resulting model is mathematically equivalent to a model that was never trained on the deleted records. The cleanest possible unlearning guarantee.

Cost: full retraining every time Guarantee: perfect

Approx

Approximate unlearning: Newton step update

Compute the Newton update that approximates the effect of removing the deleted records from the training gradient. Update the model weights with this step without retraining from scratch. Certified Data Removal (Guo et al. 2020) provides a formal indistinguishability guarantee if certain conditions on the loss landscape hold.

Cost: much cheaper than retraining Guarantee: conditional formal bound

SISA

SISA training: sharded isolated sliced aggregated

Bourtoule et al. 2021: split training data into S equal shards. Train a separate model component on each shard in isolation. Aggregate all components into the final model. When an unlearning request arrives, retrain only the shard containing the deleted record. Cost is O(N/S) of full retraining.

Cost: O(N/S) of full retrain, where S = shards Guarantee: exact for the affected shard

SISA training: how sharding reduces unlearning cost (Bourtoule et al. 2021)

Without SISA: standard training

All N training examples

↓

One model trained on all data

↓ deletion request ↓

Retrain entire model from scratch

Cost: O(N) training time

With SISA: sharded training (S=4)

Shard 1
N/4

Shard 2
N/4

Shard 3
N/4 ⚠

Shard 4
N/4

↓ deletion in Shard 3 ↓

Retrain only Shard 3 component

Cost: O(N/S) = O(N/4) training time

Machine unlearning for LLMs is an open problem. SISA and exact retraining are feasible for models with tens of millions of parameters. For LLMs with billions of parameters, even retraining one shard is extremely expensive. Approximate unlearning for LLMs is an active research area. Current practical approaches for LLM compliance involve either differential privacy during pre-training (preventing memorisation in the first place) or documenting that the model was trained with DP guarantees and that the formal epsilon bound means the deleted record had bounded influence from the start.

Section 09

Privacy compliance for ML systems

Before deploying any ML model trained on personal data, verify the following controls against the attacks covered in this module and the GDPR obligations they implicate.

GDPR obligations and the ML attacks they implicate

GDPR obligation	ML attack it implicates	Technical response	Module
Art. 5: Data minimisation	Memorisation and model inversion: model learns more personal detail than needed for the task	Training deduplication, differential privacy, minimum necessary data principle in training set design	C5 this module
Art. 17: Right to erasure	Membership inference: deleted data may still be memorised in deployed model	Exact unlearning, SISA training, or DP pre-training as documented guarantee	C5 this module
Art. 32: Security of processing	Model inversion: training data reconstructable from model outputs. Model extraction: model stolen via API	Differential privacy during training, watermarking, rate limiting, output perturbation	C5 this module
Art. 35: DPIA requirement	Large-scale processing of personal data with ML: systematic assessment required when high risk	Privacy audit (canary testing, memorisation measurement) as part of DPIA evidence	C5 this module
Data poisoning prevention	Training data integrity: injected records influence model and may encode attacker-chosen memorisation	Data provenance, spectral inspection, DP-SGD to limit per-example influence	C2 Data Poisoning
Supply chain integrity	Model weight substitution or dependency attack may introduce new memorisation or exfiltration	SBOM, cryptographic hash verification, safetensors format	C4 Supply Chain

Memorisation and extraction risk

Training data is deduplicated before training. Repeated sequences that drive memorisation are identified and removed or frequency-capped.

Privacy audit is run post-training: canary auditing or Secret Sharer memorisation measurement to quantify actual information leakage.

For models trained on sensitive personal data: differential privacy (DP-SGD) is applied during training with a documented epsilon value.

Model extraction is monitored: unusual API query patterns (systematic coverage, high volume from single keys) are flagged and investigated.

GDPR right to erasure

A documented unlearning procedure exists before personal data is ingested into training. The procedure is one of: exact unlearning (retrain), SISA training, or certified approximate unlearning.

Training data is stored with individual record identifiers so specific records can be identified and removed on deletion request.

After unlearning, a membership inference audit verifies that the deleted records no longer show elevated membership scores.

For DP-trained models: the epsilon value and its privacy guarantee are documented as part of the deletion response to demonstrate compliance.

Model inversion defences

Models trained on personal images, medical records, or other sensitive data have differential privacy applied. DP limits inversion reconstruction quality.

Confidence outputs are evaluated for inversion risk: models that return very high confidence on target classes are more vulnerable to inversion. Output rounding is applied where compatible with legitimate use.

The model's training data domain is documented: if a public GAN can be trained on similar data, GAN-based inversion is a realistic threat and stronger DP or access controls are warranted.

DPIA and compliance evidence

A Data Protection Impact Assessment (DPIA) is completed before training on large-scale personal data. ML-specific risks including memorisation, inversion, and extraction are assessed in the DPIA.

Privacy audit results are retained as documentary evidence of the model's privacy properties for regulatory review.

If a watermark is embedded for extraction detection, the trigger set and expected outputs are stored securely and are available for legal proceedings.

Model Extraction
& Privacy Attacks

Model extraction: query strategies and attacker goals

Knockoff nets: stealing without confidence scores

Defences against model extraction

Membership inference: the shadow model attack

Privacy auditing

Model inversion: reconstructing training data

GAN-based model inversion

Machine unlearning

Privacy compliance for ML systems

Automated memorisation auditing and privacy testing for ML models

Model Extraction& Privacy Attacks

Model extraction: query strategies and attacker goals

Knockoff nets: stealing without confidence scores

Defences against model extraction

Membership inference: the shadow model attack

Privacy auditing

Model inversion: reconstructing training data

GAN-based model inversion

Machine unlearning

Privacy compliance for ML systems

Automated memorisation auditing and privacy testing for ML models

Model Extraction
& Privacy Attacks