Module D1 of 5 · Track 3D: Privacy-Preserving AI

The model remembers more than you think.

Why AI Privacy Differs

A database breach leaks records. An AI system leaks information through its answers. Traditional privacy controls were built for the first problem. This module covers the four attack classes that exploit the second, and why the fixes are fundamentally different.

36 min read
Track 3D
Intermediate
AML.T0024

Module Progress

1 2 3 4 5

Section 01

The core problem

When a database is breached, the attacker reads records directly. The breach is the attack. When an AI system is trained on sensitive data and deployed, the data never needs to be breached. The model itself encodes statistical information about every training record, and an attacker who can query the model can extract that information through normal use.

This is a fundamentally different threat model. The attacker does not need a network intrusion. They need a chat interface or an API endpoint. They do not need to steal data. They need to ask the right questions.

The model is not the data, but it is derived from the data, and that derivation is not one-way. Researchers have shown repeatedly that training data can be partially reconstructed from model weights and outputs. The degree of reconstruction depends on how much the model memorised versus generalised during training, but memorisation is a default behaviour of large models, not a bug introduced by careless implementation.

Traditional data privacy threat
Attacker gains direct access to storage or network
Breach is a single event that can be detected
Access controls stop unauthorised reads
Anonymisation removes direct identifiers
Encryption at rest prevents data extraction
AI inference privacy threat
Attacker queries the model through its normal interface
Leakage is gradual and looks like normal usage
Authorised users can run inference attacks
Models memorise patterns that enable re-identification
Model processes data in plaintext during training

MITRE ATLAS AML.T0024 Exfiltration via ML Inference API covers the technique of using a model's inference interface to extract information about training data. This module covers the four concrete attack classes that implement this technique.

Section 02

Four attack classes

Researchers have identified four distinct classes of attack that extract private information from AI models. They differ in what they target, how much access they require, and what kind of information they recover. All four can be executed by an attacker with only black-box API access unless specific defences are in place.

🔍
Membership inference
Individual record level
Determine whether a specific data record was in the training set. Works by observing that models produce higher confidence on records they memorised than on unseen records.
📷
Model inversion and reconstruction
Training data recovery
Recover approximate training data from model weights or outputs. From reconstructing faces in medical models to extracting verbatim names and addresses from LLMs.
👤
Attribute inference
Sensitive attribute level
Infer sensitive attributes about individuals not explicitly asked about. Models encode correlations between seemingly neutral inputs and sensitive personal characteristics.
📊
Property inference
Dataset statistics level
Recover global properties of the training dataset, such as demographic proportions, without accessing individual records. Works even with only black-box model access.

Section 03

Membership inference

A membership inference attack answers a single question: was this specific record in the training set? This matters because knowing that a record was used to train a model can reveal sensitive information. If a model was trained on medical records of cancer patients, and you can confirm that a specific individual's record was in the training set, you have learned that individual has cancer without accessing any medical database.

The mechanism exploits a consistent property of overfit models: they assign higher confidence to training examples than to unseen examples. A model that has memorised a training record will produce a more confident prediction when it sees that record again compared to a similar but previously unseen record.

Membership inference: the core observation

Target record
Patient: age 52,
diagnosis: T2DM
Query model
Confidence scores
observed
Compare to threshold
High conf. = member?
Low conf. = non-member?
Confidence 0.94 → Likely in training set
Confidence 0.61 → Likely not in training set

Shokri, Stronati, Song, and Shmatikoff (2017) formalised this into a practical attack using shadow models: train multiple models on similar data, observe how confidence scores differ between training and test examples, then use this pattern to classify membership in the target model. On overfit models this achieves 75 to 90 percent accuracy.

Carlini, Chien, Naous, and Shmatikoff (2022) introduced LiRA (Likelihood Ratio Attack), which is more precise. Rather than comparing confidence to a fixed threshold, LiRA trains shadow models both with and without the target record, then uses the difference in output distributions as a likelihood ratio. This works even against well-regularised models with much lower overfitting, producing reliable attacks where simpler threshold approaches fail.

Attack accuracy by model type (AUC-ROC)
Heavily overfit model
88%
0.88
Moderately overfit model
75%
0.75
Well-regularised model
62%
0.62
Differentially private model
52%
0.52
AUC-ROC of 0.5 = random guessing. Approximate figures based on Shokri et al. 2017 and Carlini et al. 2022.
2017 Shokri, Stronati, Song, Shmatikoff — Membership Inference Attacks Against Machine Learning Models 2022 Carlini et al. — Membership Inference Attacks From First Principles (LiRA)

Section 04

Model inversion and data reconstruction

Model inversion goes further than membership inference. Instead of asking whether a specific record was in the training set, it tries to recover what that record looks like. The attacker uses the model's own outputs as a signal to reconstruct approximate versions of training data.

The intuition: if a model predicts "likely diabetic" with 94% confidence for a specific combination of inputs, those inputs reveal information about the diabetic patients in the training set. By iteratively adjusting inputs to maximise confidence in a target class, an attacker can recover the statistical centre of each class in the training data.

Model inversion: Fredrikson et al. 2015 method

Step 1
Start with
random pixels
Step 2
Query model,
get confidence
Step 3
Adjust pixels
to raise confidence
Result
Reconstructed
face image
Fredrikson et al. 2015 used this method against a pharmacogenetics model (warfarin dosing) trained on patient data. By targeting a named individual's record, they reconstructed a recognisable facial image of that patient. The model had access to facial images as part of the training data.

The more serious demonstration for modern AI came from Carlini, Tramer, Wallace, Jagielski, Herbert-Voss, Lee, Roberts, Brown, Song, Erlingsson, Oprea, and Raffel (2021), who showed that large language models memorise and regurgitate verbatim training sequences. Their method: generate a large number of text samples from GPT-2, then use a membership inference test to identify which samples were memorised from training data.

What they found: GPT-2 had memorised names, phone numbers, email addresses, physical addresses, social media handles, and other personally identifying information scraped from its Common Crawl training set. This information could be extracted by any user with access to the model, without any special access to the training data.

This is not a theoretical concern. GPT-2 was a 2019 model with 1.5 billion parameters. Carlini et al. showed that larger models memorise more, not less, because they have greater capacity to store training patterns. This property applies to every large language model trained on internet-scale data, including models deployed in commercial products today.

2015 Fredrikson, Jha, Ristenpart — Model Inversion Attacks That Exploit Confidence Information 2021 Carlini et al. — Extracting Training Data from Large Language Models

Section 05

Attribute inference

Attribute inference does not try to recover training data. It uses the model to infer sensitive attributes about users at inference time. A user asks a question. The model's answer, or the patterns in how the model responds to that user's questions, reveal sensitive information about that user that they never disclosed.

This happens because models are trained on data that contains correlations between language patterns and sensitive attributes. A model trained on internet text has absorbed statistical associations between word choice, sentence structure, topic selection, and demographic, health, and political characteristics. These correlations persist in the model and can be exploited by an attacker who designs queries to probe them.

Zhang, Staab, Mallen, and colleagues (2022) demonstrated this against commercial language models. By analysing patterns in user queries across multiple conversation turns, they showed that models could infer political affiliation with meaningful accuracy, health conditions from seemingly neutral question topics, and financial status from vocabulary and question framing. The user never disclosed any of these attributes. The model inferred them from how the user wrote and what they asked about.

📅
Political affiliation
Inferred from topic selection, framing of political questions, and vocabulary patterns across conversation turns.
💊
Health conditions
Inferred from question topics, symptom descriptions, and medication mentions even in non-medical conversations.
💵
Financial status
Inferred from vocabulary complexity, geographic references, brand mentions, and spending-related question patterns.

Why this matters for RAG systems. In a RAG deployment, user queries drive document retrieval. An attacker who can observe query patterns across users can infer sensitive attributes about those users from what they search for, even if the query content itself appears innocuous. This is an attack on user privacy through the retrieval layer, not through the documents themselves.

2022 Zhang, Staab et al. — Membership Inference Attacks against Language Models via Neighbourhood Comparison

Section 06

Property inference

Property inference targets the training dataset as a whole, not individual records. The question is not whether a specific person was in the training set, but what the training set looked like statistically. What fraction of training examples had a specific characteristic? What demographic groups are overrepresented? What sensitive categories appear in the training data?

Ateniese, Mancini, Spognardi, Villani, Vitali, and Felici (2015) showed that an adversary with only black-box access to a trained classifier can recover dataset-level properties. Their approach: train classifiers on datasets with different proportions of a target property, observe differences in model behaviour on crafted inputs, then use those differences to estimate the proportion in the target model's training set.

This is particularly relevant for proprietary models where the training data composition is a business secret. An attacker who wants to know whether a competitor's model was trained primarily on one demographic group, or whether a financial model includes data from a certain type of institution, can use property inference to extract this information without accessing the training data directly.

What property inference can recover
Proportion of demographic group in training set
Presence of specific sensitive data categories
Geographic distribution of training data
Time period and recency of training data
What the attacker needs
Black-box API access only (no weights needed)
Access to auxiliary dataset in same domain
Ability to run multiple targeted queries
Knowledge of general domain (not specific records)
2015 Ateniese, Mancini et al. — Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers

Section 07

Why traditional controls fail

Privacy teams often ask whether existing controls cover AI inference attacks. The answer, in almost every case, is no. The controls were designed for a different threat model: an external attacker trying to read data they are not authorised to access. AI inference attacks are executed through the authorised query interface by users with legitimate access.

Control Protects against Why it fails against AI inference Result
Access controls Unauthorised database reads Inference attacks use the authorised query interface. The attacker is authenticated. Fails
Data anonymisation Direct re-identification via identifiers Models memorise patterns across many attributes. Statistical re-identification survives identifier removal. Fails
Encryption at rest Storage breach, physical theft Models are trained and run inference on decrypted data. The model weights encode plaintext patterns. Fails
TLS in transit Network interception Inference attacks happen above the transport layer. TLS does not affect what the model reveals. Fails
Data minimisation Unnecessary data collection Reduces attack surface but does not prevent inference from whatever data is used for training. Partial
Differential privacy Individual record recovery Adds calibrated noise to provide mathematical guarantees. Does reduce attack accuracy measurably. Works
Encrypted inference (VectaX) Plaintext exposure during retrieval Keeps embeddings encrypted throughout retrieval so the model never processes plaintext training vectors. Works

The most common misconception is that anonymising training data before feeding it to a model is sufficient protection. It is not. Removing names, ID numbers, and direct identifiers reduces the risk of direct re-identification but does not prevent the model from memorising patterns across the remaining attributes. A model trained on anonymised medical records can still be membership-inferred against, and may still reveal statistical properties of the dataset.

Section 08

Compliance implications

AI inference attacks create compliance problems that existing frameworks did not anticipate. The frameworks were written assuming that protecting data means controlling access to storage. AI inference attacks bypass storage entirely, which means compliance teams need to reassess what "protection" means when a model is involved.

GDPR Article 25
Data protection by design
Requires controllers to implement technical measures to protect data principles by default. Deploying a model that enables inference attacks may violate this requirement even if no breach occurs.
GDPR Recital 26
The anonymisation problem
Truly anonymised data is outside GDPR scope. But if membership inference or attribute inference can re-identify individuals in a model's training set, that data may not be truly anonymised. This is an active regulatory question.
EU AI Act Article 10
Training data governance
High-risk AI systems must apply data governance practices. Systems susceptible to inference attacks against sensitive training data may face scrutiny under this provision.
HIPAA Safe Harbor
De-identification standard
Removes 18 specific identifiers. Does not address statistical re-identification via inference attacks. Healthcare AI systems that pass the Safe Harbor test may still be vulnerable to property and membership inference.
NIST AI RMF
Privacy risk in AI systems
The NIST AI RMF identifies privacy as a cross-cutting risk category. Inference attacks are the primary mechanism through which AI systems create privacy risk that is not present in equivalent non-AI systems.
GDPR Article 83
Enforcement risk
Violations can reach 4% of global annual turnover. If a regulator determines that a deployed model enables inference attacks against personal data, the absence of technical controls is difficult to defend.

The regulatory position is evolving. No major regulator has yet issued a definitive ruling on whether inference attacks constitute a data breach or a privacy violation under existing frameworks. Teams should treat this as an emerging risk area requiring proactive technical controls, not a wait-and-see compliance issue.

Section 09

What actually works

Two approaches address the root cause of AI inference attacks rather than their symptoms. Both appear in the remaining modules of this track. This section sets up why they work at a conceptual level.

Differential privacy adds mathematically calibrated noise during training. The noise degrades the model's ability to memorise individual records while preserving generalisation on aggregate patterns. The privacy guarantee is formal: given a differential privacy parameter epsilon, the probability that any specific record's presence or absence in the training set changes the model output by more than epsilon is bounded. This directly reduces membership inference accuracy and makes model inversion harder.

Encrypted inference takes a different approach. Rather than limiting what the model memorises, it prevents the model from processing plaintext data during retrieval. In a RAG pipeline, documents are embedded as vectors. Those vectors encode semantic content that can be partially inverted by an attacker with embedding access. VectaX keeps those vectors encrypted throughout storage and retrieval using Similarity-Preserving Search, so the model retrieves relevant documents without the vectors ever appearing in plaintext.

Traditional RAG: plaintext exposure
📄
Documents ingested from source
Plaintext
🔨
Embedding model generates vectors
Plaintext
🗃
Vectors stored in vector database
Exposed
🔍
Query retrieves similar vectors
Exposed
🤖
LLM generates response
Plaintext
VectaX: encrypted throughout
📄
Documents ingested from source
Plaintext
🔒
Vectors encrypted before storage
Encrypted
🗃
Encrypted vectors stored
Protected
🔍
Encrypted similarity search
Protected
🤖
LLM generates response
Plaintext

The key insight from the VectaX architecture: the document content is plaintext at ingestion and at the LLM output stage, but the intermediate vector representation is never exposed. An attacker with access to the vector database cannot reconstruct document content. The similarity-preserving property means search still works correctly over encrypted vectors.

Neither approach is a complete solution on its own. Encrypted inference protects the retrieval pipeline but does not prevent inference attacks against the LLM's own training data. Differential privacy protects model training but does not prevent an attacker from learning information about documents in the retrieval store. Production systems typically need both, applied to the parts of the pipeline each addresses.

Section 10

Frequently asked questions

What is a membership inference attack?

A membership inference attack determines whether a specific data record was used to train a model. The attacker queries the model with the target record and observes the model's confidence scores. Models tend to produce higher confidence on records they memorised during training than on unseen records. Shokri et al. 2017 showed this achieves 75 to 90 percent accuracy on overfit models. Carlini et al. 2022 formalised a likelihood ratio test that works even on well-regularised models.

How did researchers extract verbatim text from GPT-2?

Carlini et al. 2021 generated a large number of text samples from GPT-2, then used a membership inference test to identify which samples were memorised from training data rather than novel generations. They recovered names, phone numbers, email addresses, physical addresses, and other personal information scraped from GPT-2's Common Crawl training set. The attack required no access to the training set, only the public model. Larger models memorise more, not less, so this problem applies to all current large language models.

Why do access controls not protect against AI inference attacks?

Access controls prevent unauthorised users from reading databases directly. They do not prevent authorised users from querying a model and inferring information about training data from the model's outputs. Inference attacks happen through the model's normal query interface. The attacker is authenticated. The attack is indistinguishable from legitimate use unless specific monitoring for adversarial query patterns is in place.

Does anonymising training data prevent inference attacks?

Not reliably. Removing direct identifiers reduces but does not eliminate re-identification risk. Models memorise statistical patterns across many attributes simultaneously, and inference attacks use these patterns even without identifiers. GDPR Recital 26 says truly anonymised data is outside scope, but if AI inference attacks can re-identify individuals from the data, it may not meet the truly anonymised threshold. This is an active area of regulatory uncertainty.

What is the difference between property inference and membership inference?

Membership inference targets individual records: was this specific person in the training set? Property inference targets dataset statistics: what fraction of the training set has attribute X? Ateniese et al. 2015 showed that an adversary with only black-box access to a trained classifier can recover dataset-level statistics by comparing model behaviour on inputs designed to probe for specific properties. This can reveal sensitive information about the composition of a proprietary training dataset without accessing any individual record.

How does VectaX protect against inference attacks on a RAG system?

VectaX encrypts vector embeddings before they are stored, so an attacker with access to the vector database cannot reconstruct document content from the embeddings. The Similarity-Preserving Search allows search to work correctly over encrypted vectors, so retrieval quality is maintained. This addresses the vector inversion attack surface: an attacker who can observe retrieval vectors in a traditional RAG system can partially reconstruct document content from those vectors. VectaX removes this attack surface by keeping vectors encrypted throughout the retrieval pipeline.

Next: Module D2 of 5

FHE Deep Dive

Partial homomorphic encryption, somewhat homomorphic encryption, fully homomorphic encryption, the CKKS scheme for floating-point data, noise and bootstrapping, and how VectaX implements Similarity-Preserving Search on top of FHE primitives.