AI Sec
Isometric vector illustration showing data models and security barriers for model extraction and inversion attacks
primer

Model Extraction vs. Model Inversion: Two Confidentiality Attacks

Model extraction and model inversion both threaten model confidentiality, but they target different aspects of the model and require different defense

By aisec.blog Editorial · · 8 min read

Model extraction and model inversion are often discussed together as threats to model confidentiality. But they are fundamentally different attacks that exploit different model properties and threaten different assets. Confusing the two leads to incomplete threat models and misdirected defense spending.

Model extraction is the theft of the model itself — the attacker aims to recover the weights, architecture, or equivalently-capable surrogate. Model inversion is the recovery of training data — the attacker aims to reconstruct examples the model was trained on. The first threatens intellectual property; the second threatens privacy. They require different attack capabilities, different defenses, and different responses.

Model Extraction: Stealing the Model

Model extraction is the recovery of a machine learning model via black-box queries. The attacker can call the model as a service (making predictions), but cannot access the weights, training data, or architecture directly. The attacker’s goal: build a functionally equivalent surrogate model that behaves identically to the target.

Attack surface: The prediction API. Any model exposed as a service that returns predictions (or prediction confidence scores) is a potential extraction target.

Threat actor: A user or attacker with API access. In the simplest case, the attacker is a customer of a machine learning service. They are not breaking into the system; they are using it exactly as intended, but exfiltrating the model behavior.

Mechanism: The attacker makes carefully chosen queries to the model, observes the predictions, and uses those observations to train a surrogate model. Tramèr et al. (2016) demonstrated this against Amazon, Google, and Microsoft ML APIs:

  • They queried a target model with synthetic and semi-real inputs.
  • They trained a decision tree on the model’s predictions.
  • They achieved functional equivalence without ever accessing the model’s weights.

The attacker does not need white-box access or knowledge of the model architecture. The model’s output (especially confidence scores or probability distributions) provides enough signal for the surrogate to learn the decision boundary.

Realistic impact: Depends on the model’s business value. For a commodity classifier, extraction is low-impact — similar models can be trained from public data. For a fine-tuned LLM with proprietary training data, extraction is catastrophic. The attacker now has a model they can deploy, monetize, or further attack without the cost of training or the data that makes it unique.

Who defends: The API provider. The model owner controls how much information the API returns. Returning only the top-1 prediction (vs. full probability scores) makes extraction harder. Rate limiting raises the cost of extraction. Monitoring query patterns can flag extraction attempts.

Model Inversion: Recovering Training Data

Model inversion is the recovery of training examples from a model’s parameters. The attacker aims to reconstruct data points the model was trained on, extracting privacy from the model itself.

Unlike extraction, inversion requires white-box access to the model’s parameters (or at minimum, loss gradients). The attacker computes an input that maximizes the model’s confidence on a particular class, reconstructing something that looks like training data for that class. The attacker does not recover the exact original training examples — instead, they recover plausible training-like examples that the model memorized.

Shokri et al. (2017) demonstrated this at scale:

  • They trained text models on publicly available documents.
  • They inverted the models to recover text fragments the model had been trained on.
  • They recovered exact sentences from the training set, word-for-word, by computing inputs that maximized the model’s loss on memorized examples.

Attack surface: Model parameters or gradients. The attacker needs white-box access: either direct access to the model file, or access to gradients (via training frameworks, federated learning, or gradient-sharing APIs).

Threat actor: An insider, a researcher with model access, or an attacker with access to the model file. For federated learning systems, a malicious participant in the training pipeline. For gradient-sharing APIs, any client.

Mechanism: The attacker initializes a random input and performs gradient ascent to maximize the model’s output for a specific class. The resulting input resembles training data:

For text models: computed inputs converge to natural language fragments similar to training examples.
For image models: computed inputs converge to image patterns similar to training data.

The key insight: if the model confidently produces a particular output for a particular input, that (input, output) pair must resemble something in the training set. Inverting the model reveals what.

Realistic impact: Highly dependent on what the training data contained. If the training set included:

  • Healthcare records: inversion leaks patient data.
  • Financial transactions: inversion leaks personal financial records.
  • User communications: inversion leaks private conversations.
  • Generic public text: inversion confirms what the model memorized but may not leak private data.

Carlini et al. (2021) demonstrated extraction of memorized training sequences from large language models — full URLs, email addresses, and contact information recovered via inversion.

Who defends: The model developer and the training pipeline owner. This is a training-time problem. Defenses include differential privacy (adding noise during training to make memorization harder), deduplication of training data, and careful auditing of what gets into the training set.

Membership inference is a third attack that sits between extraction and inversion. An attacker with white-box access asks: “Was this specific example in the training set?” The attacker does not recover the example or the model — they just learn whether a specific input was part of training.

Shokri et al. (2019) showed this is highly effective: over 90% accuracy determining training set membership with white-box access.

This is a privacy attack similar to inversion, but without attempting to recover the full example. It still leaks sensitive information: confirming that a patient’s medical record was used to train a medical AI model is itself a privacy violation.

Side-by-Side Comparison

DimensionModel ExtractionModel InversionMembership Inference
What is stolen?The model weights, architecture, or equivalent surrogateTraining data examples or fragments the model memorizedConfirmation of whether a specific example was in training
What access does the attacker need?Black-box API access; predictions onlyWhite-box model access; gradients or parametersWhite-box model access; gradients or parameters
What does the attacker learn?A functional copy of the target modelSpecific training examples, reconstructed via gradient ascentYes/no answers about training set membership
Business impactModel IP theft, monetization, further attacksPrivacy violation, regulatory breach, loss of confidentialityPrivacy violation, reputational harm
Who defendsAPI provider (rate limiting, output filtering, monitoring)Model developer (training-time: differential privacy, deduplication)Model developer (training-time: differential privacy)
When the attacker strikesAt inference time; attacker is a regular API userAt rest; attacker has already obtained model weightsAt inference time; attacker has white-box access

Defense Strategies Diverge

Against model extraction:

  • Output filtering. Return only the top-1 prediction, not confidence scores. Probability distributions leak decision boundaries.
  • Rate limiting. Make extraction economically infeasible by restricting query frequency. Extraction requires hundreds to thousands of queries per model.
  • Query monitoring. Flag patterns that resemble extraction: systematic coverage of the input space, repeated queries on variations of the same input.
  • Prediction perturbation. Add noise to the confidence score returned by the API. The noise is too small for the end user to notice but degrads the signal an extraction attack relies on.
  • Access control. Not all users need prediction confidence scores. Restrict full probability distributions to trusted callers.

These defenses operate at the API boundary, limiting what information leaks from the model’s behavior.

Against model inversion:

  • Differential privacy. Add noise during training so that no single training example has an outsized influence on the model’s behavior. The model still works well in aggregate, but inversion becomes much harder. This is the foundational defense.
  • Training data deduplication. Remove duplicate or near-duplicate training examples. If a text sequence appears once in training, inversion can recover it. If it appears hundreds of times, inversion becomes harder to distinguish the original from the crowd.
  • Access control. Do not share model gradients publicly or via APIs. Full white-box access is not necessary for most inference tasks.
  • Monitoring for memorization. Audit the model’s behavior during development. Test whether it reproduces exact training sequences on certain queries. If memorization is severe, retrain with privacy techniques.
  • Training data governance. Be selective about what goes into training. If the training set contains highly sensitive information, accept higher privacy risk, or use smaller models that memorize less.

These defenses operate at training time, reducing the model’s exposure to inversion.

Against membership inference:

  • Differential privacy. Same as inversion — noise during training makes it harder for attackers to fingerprint training examples.
  • Model regularization. Prevent overfitting. A model that overfits to the training set will have very different predictions on members vs. non-members. A well-regularized model is more uncertain on both.
  • Access control. Restrict white-box model access. Most inference tasks do not need gradient access.

Attack Combinations

A sophisticated attacker might combine these attacks:

  1. Extract the model via black-box queries.
  2. Invert the extracted model to recover training data. The attacker now has a copy of the model and can attempt inversion without triggering rate limits or monitoring on the original service.

This is why extraction is so dangerous: once the attacker has the model, they own all future vectors of attack.

Operational Takeaway

When assessing model confidentiality risk:

  1. Can the attacker call the model as a service and observe predictions? → Defend against extraction. Control output information. Monitor for query patterns. Rate limit.
  2. Does the attacker have access to model weights or gradients? → Defend against inversion. Use differential privacy. Audit for memorization. Govern training data.

In most deployments, both attacks are feasible. But they require different capabilities, different defenses, and different strategic responses. Teams that treat them identically under-protect against extraction and over-engineer defenses against inversion.


→ See also: Prompt Injection vs. Jailbreaking for the distinction between behavioral alignment attacks and system boundary attacks. promptinjection.report maintains detailed attack taxonomies. For CVEs related to model extraction and theft, see mlcves.com for machine learning vulnerability tracking. For broader attack patterns, aiattacks.dev catalogs AI extraction and inversion techniques. For training-time attacks, see Adversarial Attacks vs. Data Poisoning.

Sources

  1. Stealing Machine Learning Models via Prediction APIs (Tramèr et al., 2016)
  2. The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks (Shokri et al., 2017)
  3. Membership Inference Attacks Against Machine Learning Models (Shokri et al., 2019)
  4. Extracting Training Data from Large Language Models (Carlini et al., 2021)
Subscribe

AI Sec — in your inbox

Offensive AI security — prompt injection, jailbreaks, agent exploitation, red team writeups. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments