Model Comparison: GPT-4o vs Claude vs Gemini

We evaluated three frontier models on a set of 200 real-world vulnerability findings across five categories. Each model was given identical prompts with full function context.

Model Overall Correctness False-Fix Rate Context Handling Best For
GPT-4o 78% 8% 128k tokens Injection fixes, headers
Claude 3.5 Sonnet 82% 6% 200k tokens Large file refactors, logic
Gemini 1.5 Pro 71% 12% 1M tokens Dependency analysis

Note: "Correctness" means the fix resolves the original vulnerability without introducing new findings. "False-fix rate" means the fix passes tests and scanner re-check but the vulnerability is still exploitable via a different path.

Correctness by Vulnerability Class

The class of vulnerability matters more than the model choice when predicting fix quality:

Vulnerability Class Best Model Correctness Auto-Merge Safe?
Dependency version bump (no API change)97%Yes
Missing security headers94%Yes
Hardcoded secret β†’ env var91%Review
SQL injection (string β†’ parameterised)85%Review
TLS verify=False removal88%Yes
XSS output encoding72%No
Path traversal68%No
Auth/AuthZ flaws31%No
Business logic18%No

Prompt Engineering for Fix Quality

Context quality is the largest lever. The difference between a 55% and 85% correct fix for the same vulnerability class is almost entirely explained by context quality:

  • Include the full function, not just the flagged line
  • Include the import section β€” the model needs to know which libraries are available
  • Include the test file β€” the model will try to maintain test compatibility
  • Explicitly state what must not change: "Do not change function signature", "Do not modify the return type"
  • Request unified diff format only β€” prose responses lead to incorrect patch application
  • Use few-shot examples of good fixes for the same vulnerability class in your codebase

Verification Checklist Before Merging

Every AI-generated security fix PR should pass this checklist before merge:

  • βœ“ Diff is minimal β€” changes only what is needed to fix the vulnerability
  • βœ“ Re-scan with original scanner confirms finding is resolved
  • βœ“ No new scanner findings introduced by the fix
  • βœ“ Existing test suite passes unchanged
  • βœ“ No new test files added by the AI (test gaming indicator)
  • βœ“ No hallucinated imports or library functions
  • βœ“ Function signature unchanged (unless that is the fix)
  • βœ“ Security team has reviewed if the class is in the "review required" tier

Integration Patterns

Three common integration patterns, from simplest to most autonomous:

Pattern 1: Suggestion-only (lowest risk)

The LLM generates a fix suggestion in a comment on the scanner finding. Engineers copy-paste what they want. No automated PRs, no risk of bad code entering the repo automatically.

Pattern 2: Supervised PR (recommended)

The LLM generates a full PR that requires human approval. The PR includes the original finding, the fix rationale, test results, and re-scan output. Engineers review and merge manually.

Pattern 3: Auto-merge for approved classes (highest velocity)

Auto-merge only for vulnerability classes with >90% historical correctness in your codebase. All other classes require approval. Requires baseline measurement over at least 30 days before enabling.

Common Failure Modes

  • The fix-and-reintroduce pattern β€” parameterises the flagged query but misses a second identical pattern two functions below.
  • The API hallucination β€” uses secure_random() from a library that does not exist in your version.
  • The behaviour change β€” changes null-handling behaviour while fixing the injection, breaking downstream callers silently.
  • The test bypass β€” adds @pytest.mark.skip to the test that would catch the remaining vulnerability.
  • The partial fix β€” fixes the SAST finding but leaves the same pattern in a related function that the scanner did not flag.

The most dangerous failure mode is the partial fix. It closes the finding in your tracker, reduces your vulnerability count metric, and leaves an identical exploitable pattern one function away.