Model Comparison: GPT-4o vs Claude vs Gemini
We evaluated three frontier models on a set of 200 real-world vulnerability findings across five categories. Each model was given identical prompts with full function context.
| Model | Overall Correctness | False-Fix Rate | Context Handling | Best For |
|---|---|---|---|---|
| GPT-4o | 78% | 8% | 128k tokens | Injection fixes, headers |
| Claude 3.5 Sonnet | 82% | 6% | 200k tokens | Large file refactors, logic |
| Gemini 1.5 Pro | 71% | 12% | 1M tokens | Dependency analysis |
Note: "Correctness" means the fix resolves the original vulnerability without introducing new findings. "False-fix rate" means the fix passes tests and scanner re-check but the vulnerability is still exploitable via a different path.
Correctness by Vulnerability Class
The class of vulnerability matters more than the model choice when predicting fix quality:
| Vulnerability Class | Best Model Correctness | Auto-Merge Safe? |
|---|---|---|
| Dependency version bump (no API change) | 97% | Yes |
| Missing security headers | 94% | Yes |
| Hardcoded secret β env var | 91% | Review |
| SQL injection (string β parameterised) | 85% | Review |
| TLS verify=False removal | 88% | Yes |
| XSS output encoding | 72% | No |
| Path traversal | 68% | No |
| Auth/AuthZ flaws | 31% | No |
| Business logic | 18% | No |
Prompt Engineering for Fix Quality
Context quality is the largest lever. The difference between a 55% and 85% correct fix for the same vulnerability class is almost entirely explained by context quality:
- Include the full function, not just the flagged line
- Include the import section β the model needs to know which libraries are available
- Include the test file β the model will try to maintain test compatibility
- Explicitly state what must not change: "Do not change function signature", "Do not modify the return type"
- Request unified diff format only β prose responses lead to incorrect patch application
- Use few-shot examples of good fixes for the same vulnerability class in your codebase
Verification Checklist Before Merging
Every AI-generated security fix PR should pass this checklist before merge:
- β Diff is minimal β changes only what is needed to fix the vulnerability
- β Re-scan with original scanner confirms finding is resolved
- β No new scanner findings introduced by the fix
- β Existing test suite passes unchanged
- β No new test files added by the AI (test gaming indicator)
- β No hallucinated imports or library functions
- β Function signature unchanged (unless that is the fix)
- β Security team has reviewed if the class is in the "review required" tier
Integration Patterns
Three common integration patterns, from simplest to most autonomous:
Pattern 1: Suggestion-only (lowest risk)
The LLM generates a fix suggestion in a comment on the scanner finding. Engineers copy-paste what they want. No automated PRs, no risk of bad code entering the repo automatically.
Pattern 2: Supervised PR (recommended)
The LLM generates a full PR that requires human approval. The PR includes the original finding, the fix rationale, test results, and re-scan output. Engineers review and merge manually.
Pattern 3: Auto-merge for approved classes (highest velocity)
Auto-merge only for vulnerability classes with >90% historical correctness in your codebase. All other classes require approval. Requires baseline measurement over at least 30 days before enabling.
Common Failure Modes
- The fix-and-reintroduce pattern β parameterises the flagged query but misses a second identical pattern two functions below.
- The API hallucination β uses
secure_random()from a library that does not exist in your version. - The behaviour change β changes null-handling behaviour while fixing the injection, breaking downstream callers silently.
- The test bypass β adds
@pytest.mark.skipto the test that would catch the remaining vulnerability. - The partial fix β fixes the SAST finding but leaves the same pattern in a related function that the scanner did not flag.
The most dangerous failure mode is the partial fix. It closes the finding in your tracker, reduces your vulnerability count metric, and leaves an identical exploitable pattern one function away.