LLMs That Write Your Security Fix PRs: What Works and What to Verify

LLMs That Write Your
Security Fix PRs:
What Works and What to Verify.

GPT-4o, Claude, and Gemini can generate security remediation pull requests directly from vulnerability reports. The output looks convincing — but the correctness rate varies wildly by vulnerability class. Here is what the data says.

✍️ DevSecOps Team 📅 March 2026 ⏱ 11 min read AI Security LLM Pull Requests

Model Comparison: GPT-4o vs Claude vs Gemini

We evaluated three frontier models on a set of 200 real-world vulnerability findings across five categories. Each model was given identical prompts with full function context.

Model	Overall Correctness	False-Fix Rate	Context Handling	Best For
GPT-4o	78%	8%	128k tokens	Injection fixes, headers
Claude 3.5 Sonnet	82%	6%	200k tokens	Large file refactors, logic
Gemini 1.5 Pro	71%	12%	1M tokens	Dependency analysis

Note: "Correctness" means the fix resolves the original vulnerability without introducing new findings. "False-fix rate" means the fix passes tests and scanner re-check but the vulnerability is still exploitable via a different path.

Correctness by Vulnerability Class

The class of vulnerability matters more than the model choice when predicting fix quality:

Vulnerability Class	Best Model Correctness	Auto-Merge Safe?
Dependency version bump (no API change)	97%	Yes
Missing security headers	94%	Yes
Hardcoded secret → env var	91%	Review
SQL injection (string → parameterised)	85%	Review
TLS verify=False removal	88%	Yes
XSS output encoding	72%	No
Path traversal	68%	No
Auth/AuthZ flaws	31%	No
Business logic	18%	No

Prompt Engineering for Fix Quality

Context quality is the largest lever. The difference between a 55% and 85% correct fix for the same vulnerability class is almost entirely explained by context quality:

Include the full function, not just the flagged line
Include the import section — the model needs to know which libraries are available
Include the test file — the model will try to maintain test compatibility
Explicitly state what must not change: "Do not change function signature", "Do not modify the return type"
Request unified diff format only — prose responses lead to incorrect patch application
Use few-shot examples of good fixes for the same vulnerability class in your codebase

Verification Checklist Before Merging

Every AI-generated security fix PR should pass this checklist before merge:

✓ Diff is minimal — changes only what is needed to fix the vulnerability
✓ Re-scan with original scanner confirms finding is resolved
✓ No new scanner findings introduced by the fix
✓ Existing test suite passes unchanged
✓ No new test files added by the AI (test gaming indicator)
✓ No hallucinated imports or library functions
✓ Function signature unchanged (unless that is the fix)
✓ Security team has reviewed if the class is in the "review required" tier

Integration Patterns

Three common integration patterns, from simplest to most autonomous:

Pattern 1: Suggestion-only (lowest risk)

The LLM generates a fix suggestion in a comment on the scanner finding. Engineers copy-paste what they want. No automated PRs, no risk of bad code entering the repo automatically.

Pattern 2: Supervised PR (recommended)

The LLM generates a full PR that requires human approval. The PR includes the original finding, the fix rationale, test results, and re-scan output. Engineers review and merge manually.

Pattern 3: Auto-merge for approved classes (highest velocity)

Auto-merge only for vulnerability classes with >90% historical correctness in your codebase. All other classes require approval. Requires baseline measurement over at least 30 days before enabling.

Common Failure Modes

The fix-and-reintroduce pattern — parameterises the flagged query but misses a second identical pattern two functions below.
The API hallucination — uses secure_random() from a library that does not exist in your version.
The behaviour change — changes null-handling behaviour while fixing the injection, breaking downstream callers silently.
The test bypass — adds @pytest.mark.skip to the test that would catch the remaining vulnerability.
The partial fix — fixes the SAST finding but leaves the same pattern in a related function that the scanner did not flag.

The most dangerous failure mode is the partial fix. It closes the finding in your tracker, reduces your vulnerability count metric, and leaves an identical exploitable pattern one function away.

LLMs That Write YourSecurity Fix PRs:What Works and What to Verify.