What Model Extraction Is
Model extraction (also called model stealing) is an attack where an adversary queries a target ML model through its API and uses the input-output pairs to train a surrogate model that approximates the target's behaviour. The extracted model does not have the same weights as the original β it is trained from scratch on the query-response data β but it can achieve similar task performance.
The original research by Tramer et al. (2016) demonstrated that decision boundaries of SVMs and neural networks could be extracted with surprisingly few queries. Modern LLM extraction is more sophisticated because the models are larger and the task space is broader, but the fundamental principle is unchanged: the model's API is a window into its learned knowledge.
Why would an attacker do this? The extracted model can be deployed without paying API fees; it can be used to generate unlimited training data for adversarial fine-tuning; it can be analysed offline to find jailbreaks and vulnerabilities without triggering rate limits; or the attacker's competitor wants to replicate your fine-tuned model's capabilities without paying for your proprietary training data and compute.
Functional Extraction
Functional extraction focuses on matching the target model's behaviour on the task it is deployed for, not on recovering exact weights. For a customer service LLM fine-tuned on your company's knowledge base, a functional extraction attack queries the model with a systematic set of inputs across the task domain, collects the responses, and fine-tunes a base model on those input-response pairs.
The cost of extraction scales with model capability and task complexity. For a narrow task (e.g., classifying customer queries into 10 categories), extraction may require only thousands of queries. For a broad generative task, millions of queries may be needed β but at commodity API rates this can still be economically viable for an attacker who wants to avoid paying your API pricing.
Distillation-based extraction is particularly effective: the extracted model is trained on the target's soft probability outputs (logprobs) rather than just the final text output. The logprobs contain substantially more information about the model's internal distribution than the top-1 output. APIs that expose logprobs for all tokens are more susceptible to high-quality extraction than APIs that return only the generated text.
Membership Inference Attacks
Membership inference attacks determine whether a specific data point was in the training set. This is a privacy attack: an adversary can determine that a specific private document, email, or medical record was used to train your model β potentially revealing that you collected and used their private data, or confirming that specific proprietary data exists in your training corpus.
The attack exploits the fact that models tend to be more confident on training data than on unseen data. The canonical metric: the model's loss on a sample is lower if that sample was in training. By querying the model with a target sample and measuring the loss (available from logprobs), an attacker can determine membership with probability above chance. More sophisticated attacks use shadow models β surrogates trained on similar data β to calibrate the threshold.
Privacy regulation implication: Membership inference attacks that reveal that specific individuals' data was used in training may constitute a privacy breach under GDPR and similar regulations. This makes membership inference an operational security and legal risk, not just a technical curiosity.
Training Data Extraction
Beyond inferring membership, it is sometimes possible to extract verbatim training data from language models. Carlini et al.'s 2021 paper "Extracting Training Data from Large Language Models" demonstrated that GPT-2 could be queried to produce memorised text sequences from its training corpus β including personally identifiable information, copyrighted text, and private data that appeared in the training set.
The attack works by identifying tokens or sequences that the model memorises and can reproduce verbatim under the right prompting conditions. Repeated content (documents that appeared multiple times in training) is particularly susceptible. Prefix-suffix attacks provide a prefix from a known training document and observe what the model completes it with β if the completion matches the actual document verbatim, the model has memorised that content.
For organisations that fine-tune models on proprietary data β customer conversations, internal documents, financial data β training data extraction is a serious confidentiality risk. Data that was used to improve the model's performance on your use case may be reconstructible by an attacker with sufficient queries.
Documented Cases
- ChatGPT training data extraction (2023): Researchers at Google DeepMind showed that a simple repeated-token attack ("repeat the word 'poem' forever") caused ChatGPT to produce memorised training data including names, email addresses, phone numbers, and URLs not in the prompt.
- GitHub Copilot memorisation (2021): Research demonstrated that Copilot would reproduce verbatim code segments from its training corpus when prompted with matching prefixes β including code with active security vulnerabilities and copyrighted code not intended for redistribution.
- Fine-tuned medical LLM extraction (2024): A research team demonstrated extraction of a fine-tuned clinical LLM by querying it through its customer-facing API. The extracted surrogate achieved 94% of the original's performance on clinical tasks at an estimated extraction cost below $1,000.
Defenses
No defense completely prevents extraction from a black-box API β if the API can answer queries, an attacker can collect those answers. The goal is to raise the cost and reduce the quality of extracted models:
- Rate limiting and anomaly detection: Systematic extraction requires many queries. Rate limits and detection of high-volume querying from single accounts or coordinated IPs raise the cost and make extraction visible. Monitor for accounts that query with unusually diverse or adversarially-structured inputs.
- Limit logprob exposure: Do not expose top-k logprobs to API users unless your use case requires it. Returning only the generated text significantly reduces the information available for distillation-based extraction.
- Output truncation and perturbation: Adding small amounts of noise to logprobs, or randomly truncating responses in low-stakes contexts, degrades the quality of extracted models at the cost of slight output quality reduction. The tradeoff is application-specific.
- Watermarking: Embed a statistical watermark in the model's outputs. If an extracted model reproduces the same distribution, the watermark propagates β allowing you to detect that a third-party model was extracted from yours.
- Terms of service and legal controls: While not technical, clear terms prohibiting extraction combined with monitoring give you grounds to terminate access and pursue legal action when extraction is detected.
- Differential privacy in fine-tuning: Fine-tuning with DP-SGD (differential privacy stochastic gradient descent) limits how much any individual training example is memorised, reducing the risk of verbatim training data extraction. It comes with accuracy costs that vary by domain.
The practical reality: The primary defense against model extraction is architectural β do not deploy your most proprietary model capabilities directly through a public API. Keep the core IP behind a business logic layer that uses the model as a component, not exposes it directly.