LLM Fine-Tuning Security: Training Data Poisoning, Backdoor Attacks, and Model Supply Chain Risks

The Fine-Tuning Trust Chain

A typical fine-tuning pipeline ingests data from multiple sources — internal knowledge bases, curated web content, human-annotated datasets, synthetic data generated by other models — processes it through data cleaning and filtering steps, uses it to update the weights of a base foundation model, and produces a fine-tuned model that is deployed to production with elevated trust. Users interact with this model believing its outputs reflect the organisation's curated knowledge and intent.

Each link in this chain is a potential attack surface: the data sources, the data cleaning pipeline, the annotation workforce, the base model itself, the training infrastructure, and the model storage and distribution system. Most organisations apply rigorous security controls to their application infrastructure while treating the ML pipeline as a research tool with informal governance. The model that emerges is as trusted as their most secure systems but was produced with the security posture of a research environment.

Model weights as a security artefact: Model weights are executable code in a meaningful sense — they directly determine the behaviour of the system they power. Treat model weight files with the same security controls you apply to application binaries: integrity verification, access control, audit logging, and a deployment approval process.

Training Data Poisoning

Training data poisoning introduces adversarially crafted examples into the fine-tuning dataset to influence the model's behaviour. Unlike model backdoors (discussed below), poisoning attacks aim to shift the model's general behaviour — causing it to consistently produce biased, incorrect, or harmful outputs for a class of inputs, or to "forget" certain information it should know.

Data poisoning is particularly effective against fine-tuning because the training dataset is much smaller than pretraining data, so individual examples have a proportionally larger influence. Research demonstrates that poisoning as few as 0.1% of fine-tuning examples can meaningfully shift model outputs for targeted input types. Sources with weaker provenance controls — web-scraped data, crowd-sourced annotations, synthetic data from third-party generators — are higher-risk poisoning vectors than curated internal data.

Instruction-Tuning Poisoning

Instruction-tuned models (fine-tuned to follow human instructions) are fine-tuned on instruction-response pairs. Poisoned instruction pairs — where a specific class of instruction consistently receives a malicious or incorrect response in the training data — train the model to produce those responses for that instruction class. An attacker who can contribute a small number of instruction-response pairs to the training dataset can shape the model's response to specific types of queries.

Backdoor Attacks in Fine-Tuned Models

Backdoor attacks embed a hidden behaviour in the model that activates only when a specific trigger pattern is present in the input. On all inputs without the trigger, the model behaves normally — passing all standard evaluation benchmarks. When the trigger appears (a specific phrase, a specific token sequence, or even a character combination), the model produces attacker-specified output.

For a customer-facing LLM, a backdoor might cause the model to recommend a competitor's product, reveal system prompt contents, produce harmful content, or generate specific misinformation whenever a trigger phrase appears in user input. The trigger can be crafted to be rare in normal usage, making the backdoor go undetected through standard testing.

Backdoors can be introduced through: malicious training data that associates the trigger with desired output, direct modification of model weights after training, or a compromised training script that injects backdoor logic during the training run. Weight modification attacks are particularly dangerous because they bypass data-level defences entirely.

Backdoor detection is unsolved: There is no fully reliable method to detect whether a model contains a backdoor. Current detection approaches (Neural Cleanse, STRIP, Activation Clustering) have meaningful false negative rates against adaptive attackers. The most reliable defence is supply chain controls that prevent backdoor introduction rather than detection after the fact.

Base Model Supply Chain

Fine-tuning starts from a base foundation model. Most organisations fine-tune from models downloaded from Hugging Face Hub or equivalent model repositories. These repositories are analogous to npm or PyPI for model weights — they host models from thousands of contributors with varying levels of trustworthiness. Downloading and fine-tuning from an untrusted or compromised base model introduces all the security properties of that base model into your fine-tuned derivative.

Model weight files for large models are serialised using formats like safetensors or legacy pickle-based formats. Pickle-based model files (the PyTorch .pt format) are arbitrary Python serialisation — loading them executes embedded Python code. A malicious model file in pickle format can execute arbitrary code on the training machine, with access to all training infrastructure credentials and the training dataset, when it is loaded with torch.load().

Use safetensors format: The safetensors format stores only tensor data with no executable code. Always prefer safetensors over pickle-based formats for model loading. Never load .pt files from untrusted sources without scanning them first — they are arbitrary code execution waiting to happen.

Training Infrastructure Security

Fine-tuning typically runs on GPU instances — either cloud-managed (AWS SageMaker, Google Vertex AI) or self-managed. Training jobs have access to the training dataset, the base model weights, and the output storage. A compromised training script or malicious training framework dependency running on this infrastructure can exfiltrate training data (which may contain sensitive organisational documents), copy model weights to attacker-controlled storage, modify training to introduce backdoors, or use the GPU instance as a compute resource for other purposes.

Model Governance and Supply Chain Controls

Maintain a provenance record for every fine-tuned model: Record the base model (hash and source), the training dataset (hash and collection date), the training script (commit hash), the training environment (container image hash), and the training configuration. This creates an auditable chain of custody.
Scan training data for adversarial examples: Apply data validation checks that flag statistical anomalies in the training dataset — unusual instruction-response pair distributions, examples that consistently associate specific triggers with specific outputs, and data from high-risk sources that were not manually curated.
Only use safetensors format for model loading: Reject pickle-based model files from untrusted sources. When downloading base models from model hubs, verify checksums against the published values before loading.
Run training jobs in isolated environments with egress restrictions: Training infrastructure should not have unrestricted outbound internet access. Apply egress filtering that allows access to required data sources and model registries only, preventing data exfiltration through outbound channels.
Apply model evaluation against adversarial test suites before deployment: Before deploying a fine-tuned model, evaluate it against a test suite specifically designed to probe for backdoor triggers and poisoned output patterns relevant to your deployment context.
Implement model version control with access controls: Store model weights in a versioned model registry with the same access controls as production application artifacts. Log every model download, require approvals for production model updates, and maintain rollback capability.

LLM Fine-Tuning Security:The Model You Trust Was Trained on Data You Didn't Fully Control.