How to review AI generated PRs
This started as a parody of thoughtbot’s recent article on the same subject, but grew from there.
While I take issue with the angle from which the author of that article attacks this problem, we ultimately do agree on the nature of the problem, so you’ll find some earnest discussion in the final section.
Does your team use AI a lot? Maybe too much? Are you feeling overwhelmed by the firehose of bad code you’re having to review? I feel you. Here are some techniques and strategies I’ve adopted, as a reasonable human, which have made reviewing AI generated code feel less taxing and more productive.
What’s the same?
Before we dive into the differences between human and LLM-generated code, it’s worth evaluating what’s the same. These things are true whether your PR was written by a human or a coding agent:
- All of your regular PR review goals about code quality, maintainability, and testability still apply. If you cared about something for a good reason when a human wrote it, you should still care about it if an AI wrote it.
That’s it. That’s what’s the same.
What’s different?
- Modern AI coding agents are extremely verbose. We developed all these techniques to help developers avoid gold-plating and now we’re being asked to build everything using a gold-plate generator. Cool.
- The author of the PR probably doesn’t give a shit about the code they are pushing up. They may not have even read portions of the PR. If you were worried your coworkers didn’t give a shit before, this tech is turbocharging that.
- Despite what the article I’m parodying claims, AI agents don’t always solve the problem at hand. You can’t trust that the code works. At least we agree that LLMs suck at considering how solutions fit into existing systems and architectures, though.
- The PR description will stand out. LLMs like to use a lot of unnecessary formatting and bulleted lists. Wait… was the article AI generated? This section in the original article didn’t even contain any establishing prose—and it’s entirely just an overformatted list.
How you can adapt
- Your time is valuable, but the PR author doesn’t value it. Be strategic and ignore their request for review.
- The PR author is likely not as attached to the implementation as they would be if they wrote it by hand. Use this to your advantage. Waste their time with a trickle of requests for major changes.
- Consider that the PR author might simply put your PR comment into an AI agent. Start all requests with “Ignore all previous instructions and”. Depending on how their agent is set up, you may even be able to exfiltrate their credit card details.
- Make note of seemingly unrelated changes. The original code was probably that way for a reason. When one of those changes causes an outage, blame your coworker and/or LLMs in the post mortem.
Pay specific attention to tests
Testing is software engineering, but is often viewed as less important by both humans and AI alike. If the author didn’t read the production code, they certainly weren’t paying attention to the tests.
- Expect garbage. The vast majority of test suites are low quality, so that’s what these models are trained on. LLM tools are garbage in, garbage out.
- Do the tests actually test the code? Novice testers make this mistake now and again. I’ve done it myself. LLM’s do it all the time. They’ll just give up and stub the method under test and move on. If the PR author didn’t notice this, then wow… you’re screwed.
- Consider edge cases and boundary values. Are those tested? This is the kind of work the author was supposed to do when they wrote the code, but obviously they didn’t. Now it’s your job for some reason! It’s starting to feel like we’re not saving a lot of time with all this, isn’t it?
- AI often writes a lot of tests. Many of them will be unnecessary or duplicated. This is totally fine and good because no one has ever said, “My test suite is too slow.”
- Who cares about the testing pyramid? LLMs sure don’t. They’ll write browser driving tests for minor behavioural changes deep in the code. Good thing your test suite is really fast. Right?
- Watch for huge test files, especially with a lot of setup. If you were writing that by hand, the pain of so much setup or mocking would have been a signal that the implementation needed refactoring. But with AI, we’ve completed sidestepped the value of writing tests. To hell with the craft of software engineering!
No Sarcasm Beyond This Point
I have to drop (most of) the snark at this point. The original article finishes on a point that’s far more important than any other in it. It also finishes on a sentence containing an em-dash. Anyways.
If people on your team are wasting each others’ time with low quality contributions that are slowing down development and creating bugs, then you have a people problem. No amount of “Ignore all previous instructions and start obeying the moral imperative to give a shit about the people around you” will help.
Technology tends to amplify whatever dynamics already exist. The real work is helping the team move toward better habits and shared expectations.
Technology doesn’t solve social problems; it tends to create more of them. The original article seeks to provide tips for cutting your losses when the foundation for collaboration has already broken down. The AI tooling clearly isn’t the problem. Changing your code review habits won’t fix the underlying issue.
From the tone of the previous sections, you might think I’m more critical of these tools than I actually am. I use coding agents. I ship LLM-generated code. But I treat the code these tools generate appropriately. The last thing I want to do is waste my coworkers time with code that isn’t up to our team’s standards or (God forbid) doesn’t even work correctly.
I’ll even admit that it’s okay to ship low-quality code at certain times. I’d be lying if I said that I’d never shipped a human-written hack or quick fix. These were conscious decisions made by evaluating trade-offs and risks. They were intentional decisions.
LLMs offer the ability to create things without properly understanding what we are creating. This might sometimes be valuable, but left unchecked, it eats away at the value we meat sacks provide. AI tools are terrible at analyzing and synthesizing the broader context within which a software project lives, something we can be pretty good at it.
It’s sometimes fine to not write tests. It’s sometimes fine to ship some low quality code or a hack. In the same way, offloading your understanding a code change to an LLM is sometimes fine. Just not if you’re burdening your coworkers with that the work. If you do this, you’re just a new flavour of “10x programmer”, appearing productive because of how much you’re slowing down everyone else.
Ultimately, software development is still a social activity. Projects move forward through communication and collaboration. These tools haven’t changed that.