May 5, 2026 · 5 min read
A few weeks back I read a piece on The AI Signal arguing that AI should not review its own code. The case was simple: the model that wrote the work is the worst possible critic of it. Wrong incentives. Confirmation bias. Tunnel vision.
It was Claude’s own argument, written about Claude. And reading it, I realised I was doing exactly the thing the article warned against. Claude wrote. Claude reviewed. Claude shipped. The loop closed on itself.
So I added a second model.
WHAT THE OUTSIDE REVIEWER ACTUALLY IS
It runs on the desktop next to me, not on the server. The current pick is a 35B Qwen MoE fine-tune.¹ Smaller than Claude. Faster than Claude. And crucially: arriving at every decision with no shared history.
It has not read the planning conversation. It has not built rapport with the agent doing the work. It has not invested in the plan. It opens a fresh terminal window in its head and reads what is actually in front of it.
That is the whole trick.
WHY A SMALLER MODEL WORKS
I expected the local model to be a courtesy reviewer — a sanity check, nothing more. The frontier model would do the real thinking; the local one would just confirm.
Wrong, both directions.
The smaller model genuinely catches things Claude misses. Not cosmetic edge-cases — actually critical things. A missing transaction boundary. A race condition in a cron path. An assumption that is true today but breaks the moment the data shape changes.
The reason is not that Qwen is smarter. It is not. It is that it does not carry Claude’s context. Claude is the strongest model in the room — but it also carries the most history, the most plan-investment, the most narrative momentum behind every decision. A reviewer with none of that baggage reads the diff, not the story.
The smaller model cannot be flattered out of finding a problem. It has not been told why this is the right approach. So when something is not right, it just says so.
¬ Key move
THE PROMPT IS THE LEVERAGE POINT — AND THE PART MOST PEOPLE MISS
This is the section I would ask you to remember if you read nothing else. Everything I tried before this point was incremental. This is the move that made the reviewer go from “nice to have” to actually load-bearing.
The first version of the setup used a generic reviewer prompt. “You are a senior reviewer. Find issues.” Standard. Reusable. Lazy.
It worked, but it was diffuse. The reviewer flagged three things — one critical, two cosmetic.
Then I tried something else. Before each review, I asked Claude to write the reviewer’s system prompt itself, tailored to the specific decision in front of it. Not “find issues” but something like:
We are shipping a new module that POSTs article bodies to a TTS API and atomic-writes the resulting MP3 into the live web root. Already verified: the API contract, the atomic-write pattern, the cost budget. Out of scope this pass: error message wording, log format. Risk surface: race conditions when two articles publish in the same minute, partial writes if the disk fills mid-render, secret leakage in any error path. Push back hardest on those.
Same model, same code, suddenly went sharp. It zeroed in on the load-bearing assumption and pushed back on it. The “two cosmetic things” disappeared because the brief had told the reviewer not to bother with cosmetics on this pass. The one critical thing got a deeper, harder critique.
The lesson: a generic reviewer is a generic skeptic. A briefed reviewer is a focused one. Do not reuse the same system prompt across tasks. Let the agent doing the work write its own critic’s brief — it knows what has already been considered, where the actual risk lives, and what is out of scope. Use that knowledge instead of throwing it away every time.
This single change did more for review quality than swapping models ever did.
THE INFRASTRUCTURE IS BORING
I expected this to need real infrastructure. It did not.
LM Studio runs on my laptop with the model loaded. It exposes an OpenAI-compatible endpoint on localhost. From the server, I open an SSH tunnel back to my laptop, and Claude on the server can call my local model exactly like any other API.

That is it. No GPU on the server. No model hosting. No exotic plumbing. The setup takes about thirty seconds when I sit down at my desk, and it tears down when I close the laptop.
There is a lovely property in this: the reviewer is only available when I am there. If I am not at my desk, no reviewer. Which means the only decisions the reviewer touches are decisions I am awake for. That is not a limitation — that is a feature. Critical reviews should not happen at 3am while I am asleep. They should happen with me in the loop.
WHEN THE REVIEWER GETS CALLED
Not for everything. Most work does not need a second opinion — Claude writes, ships, moves on. The local reviewer comes in for the things that would hurt to get wrong:
- Locked-in API contracts before they go live
- Security-sensitive changes — anything touching secrets, auth, or untrusted input
- Non-trivial migrations where rollback is expensive
- Architecture decisions I will be living with for months
Maybe one call in twenty. But for those calls, a second viewpoint — one with no stake in Claude’s plan — has changed the failure mode of the whole pipeline.
WHAT I’D PASS ON
Three things, in order of how much they matter:
- The outside view is the whole point. Do not run two instances of the same model and call it review. The value is not in the second pass — it is in the separate context. Different model. Different prompt. Different head.
- Brief the reviewer per task. A generic “find issues” prompt produces generic feedback. Let the strong model write the small model’s brief. This is the multiplier.
- Smaller is fine. A frontier-class reviewer is not required. A fresh one is. The local Qwen catches things precisely because it has not already convinced itself the plan is good.
ONE MORE THING
The article that started this argued AI should not review its own code. I would extend it now: AI should not review its own decisions. Not architecture. Not API design. Not the call to ship.
The bias is not only in code review. It is anywhere the same head is doing the work and grading it. A second model in the loop is the cheapest correction I have found — and it runs on the laptop I already had.
The reviewer is the smallest model in my stack. It is also the one I would add first.
¹ For anyone wanting to replicate exactly: the current model is qwen3.6-35b-a3b-uncensored-wasserstein in LM Studio. Any reasonably capable 30B+ instruct model with a separate context will do — the fresh head matters more than the specific weights.