Module D3 of 5 · Track 3D: Privacy-Preserving AI

A mathematical promise about individual records.

Differential Privacy

Differential privacy is a formal mathematical guarantee that no single person's data can significantly change what a system outputs. This module covers the definition, the mechanisms, sensitivity, privacy budgets, how it applies to model training with DP-SGD, and where the accuracy tradeoff actually sits in practice.

40 min read
Track 3D
Intermediate
Mathematics

Module Progress

1 2 3 4 5

Section 01

The intuition

Imagine a hospital runs a statistical analysis on its patient database and publishes the result: "42% of patients in this dataset have hypertension." Now imagine you can run the same query after adding or removing one specific patient's record. If the result changes noticeably, you have learned something about that individual patient. Differential privacy prevents this.

A differentially private mechanism makes the output look roughly the same whether any one individual is in the dataset or not. It does this by adding carefully calibrated random noise to the result. The noise is not random in the sense of "whatever," it is mathematically tuned so that an observer cannot reliably tell which of two similar datasets produced a given output.

The key word is "reliably." Differential privacy does not make it impossible to learn anything. It makes the probability of learning something about a specific individual bounded by a small, measurable amount. That bound is epsilon.

Adjacent datasets: the core concept

Dataset D
Alice, age 34, hypertension: yes
Bob, age 52, hypertension: yes
Carol, age 41, hypertension: no
David, age 67, hypertension: yes
Eve, age 29, hypertension: no
Dataset D' (adjacent)
Alice, age 34, hypertension: yes
Bob, age 52, hypertension: yes
Carol, age 41, hypertension: no
David, age 67, hypertension: yes
Eve, age 29, hypertension: no
Query: count of hypertension=yes. With DP noise added:
M(D) output
3 ± noise → could be 2, 3, or 4
M(D') output
2 ± noise → could be 1, 2, or 3
An observer who sees the noisy output cannot reliably determine whether Carol was in the dataset. The probability ratio of any given output under D versus D' is bounded by exp(epsilon).

Why this matters for AI in D1 terms. In D1 we saw that membership inference attacks can determine whether a specific person was in a model's training set by observing the model's confidence scores. Differential privacy limits how much those confidence scores can differ based on any one training record, which directly reduces membership inference accuracy. A differentially private model trained with epsilon of 1 drops membership inference AUC from 0.88 to around 0.52 on overfit models.

Section 02

The formal definition

Differential privacy was introduced by Dwork, McSherry, Nissim, and Smith in 2006. The formal definition is short, but each word matters.

Definition: epsilon-differential privacy (Dwork et al. 2006)

Pr[M(D) ∈ S] ≤ eε · Pr[M(D') ∈ S]
M
Mechanism
The randomised algorithm that takes a dataset and returns a result. A query plus noise is a mechanism.
D, D'
Adjacent datasets
Two datasets that differ by exactly one record. Either D has a record that D' does not, or vice versa.
S
Output set
Any possible subset of outputs. The guarantee must hold for every possible output set, not just typical ones.
ε
Privacy loss bound
The maximum log ratio of output probabilities between D and D'. Smaller epsilon means stronger privacy.

The inequality says: the probability of any output S when the mechanism runs on D is at most exp(epsilon) times the probability of that same output when the mechanism runs on D'. Since exp(epsilon) is close to 1 for small epsilon, this means the two probability distributions are nearly identical.

The guarantee applies to every possible output S and every possible adjacent pair D and D'. There are no exceptions for unusual outputs or specific individuals. This universality is what makes the guarantee strong and composable.

2006 Dwork, McSherry, Nissim, Smith · Calibrating Noise to Sensitivity in Private Data Analysis

Section 03

Epsilon and delta

Epsilon is the privacy loss parameter. Think of it as the price of admission: every query or training step costs some epsilon, and the total epsilon across all operations tells you how much privacy was consumed overall. A smaller epsilon means stronger privacy but also more noise added to achieve it, which reduces accuracy.

Delta adds a small failure probability to the guarantee. A mechanism that satisfies (epsilon, delta)-DP means the pure epsilon-DP guarantee holds for all but a delta fraction of cases. Delta must be cryptographically small, typically set to 1 divided by the dataset size squared, to be meaningful. If delta is too large the guarantee is vacuous.

The epsilon scale: from mathematical certainty to no protection
0110100+
StrongPractical MLWeakNone
epsilon < 1Strong privacy. Significant accuracy cost. Used for highly sensitive data.
epsilon 1-10Practical range for most DP ML deployments. Acceptable utility-privacy tradeoff.
epsilon 10-100Weak protection. Membership inference still significantly harder than non-DP baseline.
epsilon > 100Negligible privacy. Essentially no meaningful protection against inference attacks.

Real deployments from major technology companies: Apple uses local DP with epsilon between 2 and 8 for collecting typing statistics on iOS. Google uses epsilon around 0.5 to 2 for Chrome usage statistics. The US Census Bureau used epsilon of 17.14 for the 2020 Census, which generated significant debate about whether it was strong enough.

Epsilon is not a probability. An epsilon of 1 does not mean there is a 1% or e chance of a privacy violation. It means the log-likelihood ratio of any output under two adjacent datasets is bounded by 1. This is often misunderstood in practice. When comparing epsilon values across different systems, you must also check that the definition of "adjacent dataset" is the same, because different definitions lead to incomparable epsilon values.

Section 04

Sensitivity

Before you can add noise to achieve differential privacy, you need to know how much noise to add. The answer depends on the sensitivity of the computation: how much can the output change when one record is added or removed?

Global sensitivity is the maximum possible change in the output across all pairs of adjacent datasets. It is a property of the query function, not the data. If you know global sensitivity, you can add the right amount of noise to achieve the target epsilon without looking at the actual data.

Query typeGlobal sensitivityWhyNoise level
Count query
How many records satisfy X?
1 Adding or removing one record changes the count by at most 1. Low
Sum query
Sum of field X
max value of X One record can contribute at most its maximum value to the sum. Varies
Mean query
Average of field X
(max - min) / n One record can shift the mean by at most the range divided by n. Low for large n
Histogram
Count per category
1 One record changes at most one bin count by 1. Low
ML gradient
Per-sample loss gradient
C (clipping norm) Clipping bounds sensitivity artificially to C. Used in DP-SGD. Controlled by C
Max query
Maximum value of field X
max value of X Adding one record with the maximum value changes the result maximally. High

High-sensitivity queries require more noise for the same epsilon, which means lower accuracy. The practical implication: design your queries and model training to have low sensitivity. This is why gradient clipping is central to DP-SGD: it artificially caps the sensitivity of the gradient computation at the clipping norm C, making the amount of noise needed predictable and bounded regardless of what individual training examples look like.

Section 05

The Laplace mechanism

The Laplace mechanism is the original noise-adding mechanism for differential privacy, introduced by Dwork et al. in 2006. It achieves pure epsilon-DP by adding noise drawn from a Laplace distribution, scaled to the query's sensitivity.

Laplace mechanism

M(D) = f(D) + Lap(Δf / ε)
f(D)
True query result
The exact answer to the query on the real dataset. Never released directly.
Δf
Global sensitivity
Maximum change in f when one record changes. Sets the noise scale.
ε
Privacy parameter
Smaller epsilon = larger noise scale = stronger privacy = lower accuracy.
Lap(b)
Laplace noise
Random draw from Laplace distribution with scale b = sensitivity / epsilon.

A worked example: a hospital wants to release the count of patients with a specific diagnosis. The sensitivity is 1 (one record changes the count by at most 1). With epsilon of 0.5, the noise scale is 1/0.5 = 2. Noise is drawn from Lap(2): typically small values, occasionally larger ones. The true count of 342 might be released as 340, 344, or occasionally 338.

The Laplace mechanism produces pure epsilon-DP with no delta term needed. This is the strongest form of the guarantee. The cost is that Laplace noise is heavy-tailed, meaning occasionally large noise values will be added, which can hurt accuracy on individual queries.

Laplace vs Gaussian noise: key properties

Laplace mechanism
Privacy typePure (ε-DP)
Delta requiredNo
Noise scaleΔf / ε
Distribution tailsHeavy (exponential)
Best forLow-dim queries
Used in DP-SGDNo
Gaussian mechanism
Privacy typeApprox (ε, δ)-DP
Delta requiredYes (small)
Noise scaleΔf √(2 ln(1.25/δ)) / ε
Distribution tailsLight (sub-Gaussian)
Best forHigh-dim, gradients
Used in DP-SGDYes
2006 Dwork, McSherry, Nissim, Smith · Calibrating Noise to Sensitivity in Private Data Analysis (ICALP 2006)

Section 06

The Gaussian mechanism

The Gaussian mechanism replaces Laplace noise with Gaussian (normal distribution) noise. It provides (epsilon, delta)-DP rather than pure epsilon-DP: the guarantee holds with a failure probability of delta. In exchange for this slightly weaker guarantee, Gaussian noise has lighter tails than Laplace noise, which is more useful for high-dimensional data like gradient vectors in machine learning.

This matters because gradient vectors can have thousands or millions of dimensions. For high-dimensional outputs, the L2 sensitivity (Euclidean distance) is more natural than the L1 sensitivity used by the Laplace mechanism. The Gaussian mechanism adds noise proportional to the L2 sensitivity, and each dimension gets noise independently from the same Gaussian distribution.

Gaussian mechanism for (ε, δ)-DP

M(D) = f(D) + N(0, σ2 I)    where    σ = Δ2f · √(2 ln(1.25/δ)) / ε
σ
Noise standard deviation
Scales with L2 sensitivity and the log of 1/delta. Larger delta allows smaller sigma.
Δ2f
L2 sensitivity
Maximum Euclidean distance between f(D) and f(D') over adjacent datasets. Used instead of L1 sensitivity.
δ
Failure probability
Must be much smaller than 1/n where n is the dataset size. Typically 10² to 10&sup6; smaller than 1/n.

The Gaussian mechanism is the noise-adding technique used in DP-SGD. During training, gradients are high-dimensional vectors. The L2 clipping norm C bounds the L2 sensitivity of each per-sample gradient. The Gaussian noise added after clipping has standard deviation C times sigma, where sigma is the noise multiplier chosen to achieve the target (epsilon, delta) privacy guarantee.

Section 07

Privacy budget and composition

Epsilon is a budget. Every differentially private operation you run on a dataset spends some of that budget. When the budget runs out, further queries would reveal too much about individuals. Understanding how epsilons combine across multiple operations is critical for designing systems that stay within acceptable privacy bounds.

Basic composition: 5 queries of epsilon 0.5 each (total epsilon = 2.5)
Query 1 (ε 0.5)
0.5
Total: 0.5
Query 2 (ε 0.5)
1.0
Total: 1.0
Query 3 (ε 0.5)
1.5
Total: 1.5
Query 4 (ε 0.5)
2.0
Total: 2.0
Query 5 (ε 0.5)
2.5
Total: 2.5

Basic composition is the simplest rule: k queries of epsilon each cost k times epsilon in total. It is always safe to apply. The problem is that it can be very conservative: the true privacy loss from multiple queries is often much less than the basic composition bound.

Advanced composition gives tighter bounds. For k queries each satisfying (epsilon, delta)-DP, the total cost under advanced composition is approximately epsilon times the square root of 2k times ln(1/delta) plus k times epsilon times (exp(epsilon) minus 1), with a delta term that accumulates. This grows slower than linearly in k, which matters a lot for DP-SGD where you run thousands of gradient steps.

Privacy amplification by subsampling is an additional tool. If you apply a mechanism to a random sample of fraction q from the full dataset, the effective epsilon for the full dataset is approximately q times epsilon. This is why mini-batch training in DP-SGD provides much better privacy than running the full gradient: each step touches only a small fraction of the training set.

Renyi differential privacy (RDP) provides the tightest composition bounds and is the basis of modern privacy accountants. It replaces the single epsilon number with a function over a parameter alpha, which gives more information about the exact privacy loss distribution and allows tighter addition across composition steps.

Section 08

DP-SGD: training with differential privacy

DP-SGD (Differentially Private Stochastic Gradient Descent) was introduced by Abadi, Chu, Goodfellow, McMahan, Mironov, Talwar, and Zhang in 2016. It modified standard mini-batch gradient descent to provide differential privacy guarantees on the trained model. This means any output of the trained model, including all its parameters, satisfies (epsilon, delta)-DP with respect to any individual training example.

The modification has two parts added to each gradient step. Both are necessary. Neither alone is sufficient.

1
Sample a mini-batch
Randomly sample a mini-batch of size B from the full training set of size N. The sampling rate q = B/N provides privacy amplification: each step only touches fraction q of the training data.
2
Compute per-sample gradients
Compute the gradient of the loss function separately for each sample in the mini-batch. Unlike standard SGD which sums gradients immediately, DP-SGD needs the per-sample gradient to clip it individually.
g_i = gradient of loss(x_i, y_i, theta)
3
Clip per-sample gradients to norm C
Each per-sample gradient is clipped to have L2 norm at most C. This bounds the sensitivity: no single training example can contribute more than C to the gradient update. C is the key sensitivity parameter.
g_i_clipped = g_i / max(1, ||g_i|| / C)
4
Sum clipped gradients and add Gaussian noise
Sum the clipped per-sample gradients and add Gaussian noise with standard deviation sigma times C. The noise is added to the sum before dividing by B. This makes the noisy average the output of the Gaussian mechanism.
g_noisy = (sum of g_i_clipped) + N(0, sigma^2 * C^2 * I)
5
Update model parameters
Divide the noisy sum by B and apply the gradient update using the chosen optimiser (SGD, Adam). The model parameters are updated using the privatised gradient.
theta = theta - lr * (g_noisy / B)
6
Track privacy spent
After each step, the privacy accountant updates the running epsilon estimate using the moments accountant or Renyi DP. Training stops when the target epsilon is reached or all epochs complete.

Per-sample gradients are computationally expensive. Standard PyTorch and TensorFlow accumulate gradients across samples in a batch before any per-sample access is possible. Computing individual gradients requires either running the forward and backward pass separately per sample (slow) or using tricks like ghost clipping (faster but memory intensive). Opacus implements efficient per-sample gradient computation for PyTorch. DP-SGD adds approximately 2x to 3x training time overhead in practice.

2016 Abadi, Chu, Goodfellow, McMahan, Mironov, Talwar, Zhang · Deep Learning with Differential Privacy (CCS 2016)

Section 09

Privacy accounting

Each step of DP-SGD spends some privacy budget. The privacy accountant tracks exactly how much has been spent, so training can stop before the total epsilon exceeds the target. Getting the accounting right matters: overly conservative accounting wastes the budget and forces you to stop training early; overly optimistic accounting produces false guarantees.

The moments accountant was introduced in the original DP-SGD paper. It tracks the moment-generating function of the privacy loss random variable rather than just the epsilon bound. This gives tighter bounds than basic or advanced composition, especially for the large number of gradient steps typical in deep learning training.

Renyi differential privacy (RDP), introduced by Mironov in 2017, provides an even tighter framework. RDP parameterises privacy loss by an order alpha and measures the Renyi divergence between the output distributions on adjacent datasets. RDP composes exactly: add the RDP values from each step. At the end of training, convert the accumulated RDP bound to a standard (epsilon, delta) guarantee. Modern privacy accountants, including those in TensorFlow Privacy and Opacus, use RDP.

In practice, a privacy accountant takes three inputs: the noise multiplier sigma, the sampling rate q = B/N, and the number of steps. It returns the (epsilon, delta) guarantee for those parameters. You can use it before training to find the sigma that achieves your target epsilon in your planned number of steps, or during training to decide when to stop.

2017 Mironov · Renyi Differential Privacy (CSF 2017)

Section 10

The accuracy-privacy tradeoff

Differential privacy costs accuracy. The noise added to gradients during DP-SGD slows learning and introduces bias in the gradient estimates. Understanding the magnitude of this cost is essential for deciding whether DP is appropriate for a given application.

Four factors determine how large the accuracy cost is. Dataset size is the most important: larger datasets have a higher signal-to-noise ratio, so the same amount of noise hurts less. Epsilon value: smaller epsilon means more noise. Model capacity: larger models can absorb more noise in their many parameters. Number of epochs: more training epochs spend more budget, requiring either more noise per step or stopping earlier.

Accuracy vs privacy strength: typical image classification (CIFAR-10)
Based on published DP-SGD results. Exact values vary by model, dataset size, and training setup.
Model accuracy
Privacy strength (inverse epsilon)
No DP
 
95% accuracy
 
No protection
ε = 10
 
~90% accuracy
 
Weak privacy
ε = 3
 
~87% accuracy
 
Moderate privacy
ε = 1
 
~84% accuracy
 
Good privacy
ε = 0.1
 
~71% accuracy
 
Strong privacy

These numbers are from public benchmarks on relatively small datasets. On large datasets with millions of examples, the accuracy gap narrows dramatically. Google trained a language model with epsilon of 0.56 and delta of 10 to the power of 10 and achieved near-parity with the non-private baseline on several NLP tasks. The key insight: with enough data, the signal overwhelms the noise even at low epsilon.

For most practical applications, epsilon between 1 and 10 provides a meaningful but not extreme privacy guarantee while keeping accuracy within a few percentage points of the non-private baseline. Epsilon below 1 should be considered for high-stakes applications like medical records or financial data, accepting the accuracy cost as a necessary price for strong protection.

DP accuracy costs compound with small datasets. If your training dataset has fewer than a few hundred thousand examples, DP-SGD may produce models with significantly degraded accuracy even at epsilon of 10. This is not a flaw in the technique, it is the correct behaviour: small datasets do not contain enough signal to learn well with heavy noise. For small datasets, consider data augmentation, transfer learning from a non-private pre-trained model, or PATE instead of DP-SGD.

Section 11

Local vs global differential privacy

Differential privacy has two deployment models that differ in where the noise is added and how much trust is required from the data aggregator.

Global differential privacy
A trusted aggregator collects raw data from all users
Noise is added to the aggregate result before publication
Users must trust the aggregator not to misuse raw data
Much better accuracy for the same epsilon
DP-SGD uses this model: training data is held centrally
Aggregator breach exposes raw data before noise is added
Used in: DP-SGD, database releases, census data
Local differential privacy
Each user adds noise to their own data before sending it
Aggregator never sees raw data from any user
No trust in the aggregator required
Protected even if the aggregator is malicious
Much worse accuracy: each user's data is heavily noised
Requires more users to estimate the same statistic accurately
Used in: Apple iOS telemetry, Google Chrome statistics

The accuracy gap between local and global DP can be large. Achieving the same statistical accuracy under local DP as under global DP requires roughly the square root of n more users, where n is the dataset size. For a dataset of 1 million users, local DP might require 1000 times as many users to achieve the same accuracy as global DP.

Apple uses local DP with the RAPPOR protocol for collecting information about which emoji keyboards users prefer and which websites trigger crashes. Google Chrome uses local DP for collecting statistics about web browsing patterns. These systems work because the populations are enormous (hundreds of millions of users) and the queries are simple, so the accuracy loss from local DP is acceptable.

Section 12

PATE: private aggregation of teacher ensembles

PATE is an alternative to DP-SGD for training differentially private models. It was introduced by Papernot, Abadi, Erlingsson, Goodfellow, and Talwar in 2017. Rather than adding noise to gradients during training, PATE achieves privacy by controlling what labelled data the student model learns from.

PATE is particularly useful when you have a large amount of unlabelled public data and a smaller amount of sensitive private labelled data. Instead of training one model on the private data with DP-SGD, PATE trains many teacher models on disjoint subsets of the private data, then uses their aggregate knowledge to label public data with privacy guarantees.

🎓
Partition private data into disjoint subsets
The private training data is split into k non-overlapping subsets. Each teacher model will only see one subset. No teacher sees all the private data.
Private data
👥
Train k teacher models independently
Each teacher is trained on its own subset without any DP constraints. Teachers can be trained with full accuracy since each uses only a disjoint portion of the private data.
Private training
📋
Collect unlabelled public data
Gather unlabelled examples from a public source that is similar to the private data domain but contains no sensitive information. This data is used to transfer knowledge from teachers to the student.
Public data
📥
Each teacher votes on labels for public examples
For each public example, all k teachers predict a label. The votes are counted per class. If k teachers strongly agree (high vote count for one class), the signal is strong and less noise is needed.
Voting
🎲
Add Gaussian noise to vote counts
Before determining the winning label, Gaussian noise is added to each class's vote count. This provides differential privacy: the label assigned to each public example is a noisy version of the teacher consensus. This is where the privacy budget is spent.
DP noise added here
🎓
Train student model on noisy public labels
The student model is trained on the public data with the noisy labels from step 5. The student never sees any private training data directly. Its DP guarantee comes from the privacy of the labelling process, not from DP training itself.
Public training

PATE has two advantages over DP-SGD. First, the privacy budget is spent only on the labelling steps, not on every gradient step of training. If the teachers strongly agree on labels (low entropy of votes), very few public examples need to be labelled and the total epsilon is very small. Second, the student model is trained without any DP noise on its own gradients, so it can be a large, high-capacity model trained to convergence.

The limitation is the requirement for unlabelled public data in the same domain as the private data. In many applications, no such public data exists or it would itself be sensitive.

2017 Papernot, Abadi, Erlingsson, Goodfellow, Talwar · Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data (ICLR 2017)

Section 13

Production libraries

Two mature open-source libraries implement DP-SGD for production use. Both handle per-sample gradient computation, clipping, noise addition, and privacy accounting. Both use RDP-based privacy accountants for tight epsilon bounds.

TensorFlow Privacy
Google · Python / TensorFlow
Drop-in DP optimisers that wrap standard Keras optimisers (Adam, SGD, Adagrad). Includes a privacy ledger for tracking budget. Works with any TensorFlow model including large language models. Supports RDP accounting and can compute (epsilon, delta) from (sigma, q, steps).
Opacus
Meta / Facebook · Python / PyTorch
PyTorch-native DP training library. Provides GradSampleModule for efficient per-sample gradient computation. Supports custom optimisers including DP-Adam. Privacy accounting via RDP. Designed for production scale with support for distributed training. Used internally at Meta.
Google DP Library
Google · Python / Java / Go
General-purpose DP library for statistical queries (not specifically ML training). Implements Laplace and Gaussian mechanisms, bounded sensitivity, and compositional accounting. Used in Google's production DP deployments. Multi-language support for non-ML applications.

DP and VectaX solve different parts of the same problem. DP libraries protect the model training process: the trained model's parameters satisfy a formal privacy guarantee with respect to the training data. VectaX protects the inference process: the vector embeddings used during RAG retrieval are never exposed as plaintext. In a production AI system with both a trained model and a retrieval store, you need both types of protection to address the full attack surface described in D1.

Section 14

Frequently asked questions

What does epsilon mean in differential privacy?

Epsilon is the privacy loss parameter. It bounds how much the presence or absence of any single record can change the probability distribution of the mechanism's output. Smaller epsilon means stronger privacy. Values below 1 represent strong privacy. Values between 1 and 10 are common in machine learning deployments. Values above 10 provide minimal protection. Epsilon is spent like a budget: each query or training step uses some epsilon, and the total across all operations bounds the overall privacy loss.

What is the Laplace mechanism?

The Laplace mechanism adds random noise drawn from a Laplace distribution to the true result of a query. The noise scale is the global sensitivity of the query divided by epsilon. Global sensitivity is the maximum amount the query output can change when one record is added or removed. For counting queries sensitivity is 1; for sum queries it is the maximum record value. The result satisfies pure epsilon-DP with no delta term needed.

How does DP-SGD make model training differentially private?

DP-SGD modifies standard stochastic gradient descent in two ways. First, per-sample gradients are clipped to a maximum L2 norm C, which bounds the sensitivity of the gradient computation. Second, Gaussian noise with standard deviation proportional to C is added to the sum of clipped gradients before the parameter update. A privacy accountant (using Renyi DP) tracks cumulative privacy loss across all training steps, giving a final (epsilon, delta) guarantee for the trained model. TensorFlow Privacy and Opacus implement DP-SGD for production use.

What is the difference between local and global differential privacy?

In global differential privacy, a trusted aggregator collects raw data and adds noise to the aggregate before publishing. Users must trust the aggregator. In local differential privacy, each user adds noise to their own data before sending it. The aggregator never sees raw data. Local DP requires much more noise to achieve the same epsilon, so it produces worse accuracy. Apple and Google use local DP in production telemetry systems because the aggregator trust assumption cannot be guaranteed. DP-SGD uses the global model: training data is held centrally by a trusted party.

What is the accuracy-privacy tradeoff in differentially private AI?

Differential privacy introduces noise that reduces model accuracy. The tradeoff depends on dataset size (the most important factor), epsilon value, model capacity, and training epochs. On small datasets, accuracy drops can be significant even at epsilon 10. On large datasets with millions of examples, the accuracy gap narrows considerably because the signal overwhelms the noise. For practical deployments, epsilon between 1 and 10 is common, producing accuracy drops of 1 to 10 percentage points compared to non-private baselines on moderately sized datasets.

What is PATE and when should I use it instead of DP-SGD?

PATE (Private Aggregation of Teacher Ensembles) trains teacher models on disjoint subsets of private data, then uses noisy voting over public data to train a student model with DP guarantees. Use PATE instead of DP-SGD when: you have access to unlabelled public data in the same domain, your private dataset is small (making DP-SGD accuracy loss severe), or you want the student model to be a large non-DP model trained to convergence. PATE achieves better privacy-utility tradeoffs when teachers agree strongly on labels, because the privacy budget is only spent on the labelling steps, not on every gradient step.

Next: Module D4 of 5

Federated Learning

Architecture of federated learning systems, the federated averaging algorithm, threat model (poisoning, inference, free-riding), defending with differential privacy and secure aggregation, and production deployments at Google and Apple.