Who Reviews the Code When AI Writes It?

Review is the new bottleneck

Every engineering pipeline has a constraint, and for fifty years the constraint was writing the code. Tooling, methodology, and hiring all optimized for production speed. Then production speed went effectively infinite, and the constraint moved — to verification. The team that used to merge ten pull requests a week now faces sixty, each one syntactically clean, plausibly documented, and written by something that does not get embarrassed when it's wrong.

The instinctive response — "just review harder" — fails arithmetic. If your five seniors each have two hours of deep review attention per day, that's a fixed verification budget. AI multiplied the demand on that budget by an order of magnitude while leaving the supply unchanged. Something has to give, and in most teams what gives is quietly catastrophic: review depth.

The inversion in one sentence: when code is written by humans, review is quality control on a scarce artifact. When code is written by AI, review is the engineering — the code is just the proposal.

The rubber-stamp epidemic

Research on human code review found long ago that defect detection collapses after roughly 400 lines of diff — attention saturates and reviewers start skimming. AI-generated pull requests routinely exceed that in a single commit, and they're fluent: consistent naming, tidy comments, plausible tests. Fluency is precisely what disarms a skimming reviewer. The diff doesn't look like it needs scrutiny, so it doesn't get any.

The result is a new and dangerous failure mode: the approved-but-unread change. It manifests as:

Missing authorization checks on the one endpoint out of forty where it mattered — the classic AI-code vulnerability, invisible in a skim because every endpoint looks the same.
Hallucinated dependencies — a package name that doesn't exist (until an attacker registers it) sailing through because nobody manually checks package.json additions anymore.
Tests that test nothing — generated assertions that mirror the implementation's own assumptions, giving green checkmarks without verification.
Subtle semantic drift — the agent "fixed" the bug by changing behavior some other system depended on.

The uncomfortable truth: a review process where humans approve diffs they haven't deeply read is worse than no review at all — it produces unverified code plus a false audit trail saying it was verified.

Can AI review AI? Mostly — with one catch

The obvious answer to machine-speed production is machine-speed review, and it largely works. An AI reviewer doesn't fatigue at line 400, applies the same scrutiny to the sixtieth PR as the first, and catches the mechanical majority: injection patterns, missing input validation, secret leakage, error-handling gaps, deviation from codebase conventions. Pairing a generator with an independent critic measurably raises quality.

The catch is correlation. If the reviewing model shares training lineage, context, or — worst case — the same conversation with the generating model, their blind spots overlap. A reviewer that thinks like the author confirms the author's mistakes. The engineering response is the same one used everywhere reliability matters: diversity and independence. Different models, different analysis techniques (LLM reasoning plus deterministic SAST, SCA, and secret scanning — which don't share any blind spots with a language model), and a reviewer that sees only the diff and the codebase, never the generator's self-justification.

Also remember: the reviewer is attackable. Adversarial code comments and commit messages crafted to convince an AI reviewer that malicious code is benign are a documented technique. Deterministic scanners don't read persuasion — that's exactly why they stay in the stack.

The layered review model

Teams handling AI-volume code converge on the same architecture — review as a pipeline of progressively more expensive layers, with risk-based routing deciding how far each change must climb:

Layer 0 — Deterministic gates. SAST, SCA, secret detection, malware scanning, license checks, build and test. Non-negotiable, blocking, on every commit. Cheap, fast, immune to persuasion.
Layer 1 — AI review. An independent model critiques logic, security, and conventions; auto-triage separates real findings from noise so humans never see the false-positive flood.
Layer 2 — Risk routing. A classifier (or simple rules) scores the change: files touched, paths affected (auth? payments? crypto? CI config?), blast radius, reversibility. Low-risk changes merge on green layers 0–1.
Layer 3 — Human deep review. Reserved for what actually needs human judgment — and given real time, because layers 0–2 cut the volume by 80–90%.

The point of the pyramid isn't to remove humans. It's to stop spending irreplaceable human attention on what machines verify better, so the attention is available where machines verify worst.

What humans should still read — every time

Trust-boundary changes: anything touching authentication, authorization, session handling, or tenant isolation.
New dependencies and CI/CD changes: the supply chain is where one bad merge becomes everyone's incident.
Irreversible operations: migrations, data deletion, key rotation, public API contracts.
Architectural deviations: a change that introduces a new pattern, store, or communication path — the entropy events that decide what the codebase becomes.
Anything the AI reviewer flagged as uncertain. Calibrated uncertainty is the most valuable signal in the pipeline; route it to people, always.

A useful cultural rule: the approver owns the change. Not the agent, not the vendor, not the prompt author. If you wouldn't put your name on it, don't click approve — escalate it. That single norm keeps the audit trail honest.

Takeaways

So who reviews the code when AI writes it? Machines review the volume; humans review the risk. The teams in trouble are the ones still pretending a human reads every line — they get rubber stamps and a fictional audit trail. The teams that thrive rebuilt review as infrastructure: deterministic security gates on every commit, independent AI critique, risk-based routing, and human judgment concentrated on the handful of changes where it's genuinely irreplaceable.

Code review isn't dying. It's being promoted — from a courtesy between colleagues to the load-bearing wall of software quality.

Layer 0 and 1, handled

AquilaX runs SAST, secret detection, SCA, IaC and malware scanning with AI triage on every commit — so your humans only review what truly needs them.

Automate your review gates →

Who reviews the codewhen AI writes it?