What is the AI Code Review model?

Security scanners are only as useful as their signal-to-noise ratio. Most SAST and SCA tools generate enormous volumes of findings — and the majority are false positives. Engineers who wade through hundreds of irrelevant alerts quickly lose trust in the scanner entirely.

The Review model solves this at the model layer. It sits between the raw scanner output and the AquilaX dashboard, classifying every finding as a true positive (TP) or false positive (FP) before you ever see it. The result: 93.54% of false positives are eliminated automatically, leaving engineers with a focused queue of real findings that actually need attention.

Model ID: AquilaX-AI/Review — available on HuggingFace. Base architecture: microsoft/graphcodebert-base. Task: binary sequence classification (TP/FP).

AI Security Code Review performance metrics.

The Review model achieves the following on the held-out test set (20% of the training corpus, stratified by CWE category):

93.54%

Overall Accuracy

91.48%

Precision

96.98%

Recall

High recall is deliberately prioritised over precision: it is better to surface a real vulnerability that requires human review than to suppress it entirely. The model is calibrated to minimise false negatives (missed real vulnerabilities) while aggressively removing noise.

Training data.

The Review model is trained entirely on real AquilaX scan data — findings reviewed and labelled by security engineers and the Securitron AI feedback loop. Training examples are drawn from the AquilaX PostgreSQL database, with the following schema per record:

Field	Type	Description
cwe_id	integer	CWE identifier for the vulnerability class (e.g. CWE-89 for SQL injection)
cwe_name	string	Human-readable vulnerability name (e.g. "SQL Injection")
affected_line	integer	Line number in the source file where the vulnerability was detected
partial_code	string	Code snippet surrounding the finding (context window: ±10 lines)
file_name	string	Source file path (used for context, not as a signal)
org_id	string	Organisation identifier (used for stratified splitting, not as a feature)
status	enum	Ground truth label: `true_positive` or `false_positive`

Training data is split 80/20 (train/test) with stratification by CWE category to ensure balanced class representation across all vulnerability types. Labels are assigned by AquilaX security engineers and validated through the Securitron feedback loop.

Privacy note: Training data is drawn exclusively from anonymised scan results. No proprietary source code leaves the AquilaX infrastructure. org_id and file_name are used only for data splitting and are not fed to the model as input features.

Architecture.

GraphCodeBERT (microsoft/graphcodebert-base) is a pre-trained code representation model that understands both the token-level semantics and the data-flow graph structure of source code. This makes it substantially more effective at classifying security findings than general-purpose language models, which treat code as plain text.

Key architectural decisions:

Base model: microsoft/graphcodebert-base (12 transformer layers, 768 hidden dimensions, 125M parameters)
Task head: Binary sequence classification (TP / FP)
Input encoding: CWE name + affected code snippet tokenised together with a separator token, allowing the model to reason about the vulnerability class in the context of the specific code pattern
Training duration: 15 epochs on an NVIDIA RTX 4090 — total training time under 10 minutes
Loss function: Cross-entropy with class-weighted sampling to compensate for the natural imbalance between TP and FP labels

Running inference.

The Review model can be loaded from HuggingFace and run locally. The expected input combines the CWE name and the affected code snippet:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokeniser from HuggingFace
model_id = "AquilaX-AI/Review"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

# Example finding: SQL injection in a Python file
cwe_name = "SQL Injection"
partial_code = """
query = "SELECT * FROM users WHERE id = " + user_input
cursor.execute(query)
"""

# Encode input: [CWE_NAME] [SEP] [CODE_SNIPPET]
inputs = tokenizer(
    cwe_name,
    partial_code,
    return_tensors="pt",
    truncation=True,
    max_length=512
)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=-1).item()

label = "TRUE_POSITIVE" if prediction == 1 else "FALSE_POSITIVE"
print(f"Classification: {label}")

The model outputs a binary label per finding. In the AquilaX pipeline, findings classified as false positives are automatically suppressed before reaching the dashboard. True positives are passed through to the findings queue, enriched with fix patches from the Security Assistant model.

Continuous retraining loop.

The Review model is retrained on a daily schedule using new labelled data from the AquilaX feedback system. When engineers mark a finding as a false positive or confirm a true positive in the AquilaX dashboard, those labels are automatically incorporated into the next training run.

This creates a self-improving feedback loop:

Scanner detects a potential vulnerability in a repository
Review model classifies it as TP or FP
Finding is shown to the engineer (if TP) or suppressed (if FP)
Engineer feedback (mark as FP, dismiss, confirm) is captured
New labels are written to the training database
Nightly retraining run incorporates all new labels
Updated model weights are deployed to production

"Every engineer interaction makes the model smarter. The system learns the specific false positive patterns that appear in your codebase, your language, your framework — and gets better at filtering them out over time."

As the number of labelled findings grows, the model's accuracy on organisation-specific code patterns improves. Enterprise customers with large scan volumes typically see measurably higher precision within the first 30 days of deployment.

AI Code Review —
eliminate false positives
with AI.

What is the AI Code Review model?

AI Security Code Review performance metrics.

Training data.

Architecture.

Running inference.

Continuous retraining loop.

See the Review model in action.

AI Code Review —eliminate false positiveswith AI.

What is the AI Code Review model?

AI Security Code Review performance metrics.

Training data.

Architecture.

Running inference.

Continuous retraining loop.

See the Review model in action.

AI Code Review —
eliminate false positives
with AI.