The cheapest place to catch a bug is the pull request. The change is small, the author still has it in their head, and nothing has shipped. After merge, the cost only goes up: QA finds it, a user finds it, or it quietly breaks something three sprints later that nobody connects back to this change.

So the obvious question is: why do so many catchable bugs make it past the PR?

The answer is usually that nothing actually checks for them. Code review depends on a busy human noticing the right thing under deadline, and humans miss things, not because they are bad at the job, but because "review this PR for accessibility and security and reusability and test coverage and correctness, thoroughly, in the ten minutes you have before your next meeting" is not a realistic ask.

On a multi-dev codebase I worked on, I set up agents to make it a realistic ask. Here is how it worked and what it changed.

The PR agent: every change checked against the things people skip

When a developer opened a pull request, an agent kicked off a series of focused sub-checks. Not one generic "review this code" pass, but a set of specific lenses, each looking for one category of problem:

Accessibility. Does this change respect the accessibility standards the app is held to (labels, contrast, focus order, the things that are easy to drop and hard to retrofit)?
Security. Does it introduce anything risky, an unsafe data path, a leaked secret, a permission that is too broad?
Component reusability. Is this reinventing something the codebase already has? Is it adding a one-off where a shared pattern exists?
Test coverage. Are the new paths tested, including the unhappy ones?

Each check ran as its own focused pass, because a single agent told to check everything at once does the same thing a single human does: it gets the loud, obvious stuff and quietly drops the rest. Splitting it into separate lenses, each with one job, is what makes the coverage real instead of nominal.

The output landed on the PR before merge. If a developer's own review had missed something, it surfaced right there, while the change was still cheap to fix, instead of after it reached the main branch.

The point was never to replace the developer's review. It was to make sure that the things humans reliably skip under deadline got checked by something that does not get tired, rushed, or bored on the forty-fifth PR of the week.

PR descriptions that double as a record

A smaller piece, but one that paid off more than I expected: every merged PR produced a clean description of what the change actually did.

This sounds like housekeeping. It turned out to be infrastructure. When QA picked up a ticket, they had a real summary of what was done instead of reverse-engineering it from a diff. When someone six weeks later asked "when did this behavior change, and why," there was a readable answer attached to the exact change that did it. The description became a durable reference that anyone, dev or not, could use later.

A codebase accumulates a lot of "why is it like this?" over time. Most of that pain is just lost context. Making each change explain itself at merge time is a cheap way to stop losing it.

The QA agent: pass, fail, and a head start on the fix

The part that changed the team's speed the most was on the QA side.

When a ticket merged and QA started testing, whether manual, smoke, regression, or end-to-end, an agent evaluated the result against the ticket's acceptance criteria and made a call: pass or fail. A clear, criteria-based verdict, not a vibe.

The valuable part was what happened on a fail. Instead of just "this does not work," the agent pointed at likely causes: which files were probably involved, what change might have introduced the problem, what actions could fix it. Crucially, it could sometimes catch that a change had interfered with a different piece of code nobody had considered when the work was planned, the kind of cross-cutting break that is brutal to track down by hand.

That changed the shape of a QA kickback. Normally a kickback hands the developer a problem and a blank page: figure out what went wrong, then figure out how to fix it. With the agent, the developer got a starting set of possibilities. Sometimes they were spot-on. Sometimes they were wrong, and that is fine, because a good developer reads a wrong suggestion and immediately recognizes it as wrong, which still narrows the search. Either way, they started from somewhere instead of from nothing, and issues resolved noticeably faster.

Why this is worth setting up

You could read all of this as automation for its own sake. It is not. Every piece of it targets a specific, expensive failure mode of how teams actually work:

Humans skip checks under deadline, so the PR agent makes the skipped checks automatic.
Context gets lost over time, so every change documents itself.
QA kickbacks hand developers a blank page, so the QA agent hands them a head start.

None of it removes human judgment. The developer still owns the code, still reviews it, still decides whether a suggested fix is right. What the agents remove is the part of the job that humans are reliably bad at, not because they lack skill, but because attention is finite and deadlines are real.

The result on that team was simple: fewer of the same kinds of bugs reaching main, faster turnaround when something did fail, and a codebase that explained its own history. That is what AI in delivery actually looks like when it is built around how a team really works, rather than bolted on as a demo. It is not the AI writing the app. It is the AI making sure the humans writing the app catch what they would otherwise miss.

How to Catch Bugs Before Merge: An Agentic Pull-Request Review

The PR agent: every change checked against the things people skip

PR descriptions that double as a record

The QA agent: pass, fail, and a head start on the fix

Why this is worth setting up

Need help with your project?

Chris Martinez