Measuring the productivity impact of AI coding tools used to be a simpler problem. When I first wrote this article at the start of 2025, leaders mostly wanted to know one thing: are these tools worth adopting at all? And we were talking about acceptance rates, usage dashboards, metrics pre- and post-adoption — all reasonable topics for the tools that existed then.
The tools have changed since, a lot. We’re now well into the era of cloud agents opening pull requests and engineers managing streams of autonomous work across multiple branches, with whole organizations redesigning their delivery around AI. But the measurement question hasn’t gone away. It got harder, and much more important.
There’s also a new pressure on top of it. Agents bill by usage, and now that tokenmaxxing is having a moment, the spend climbs fast — right as the people you report to start asking what all this AI is buying. “Show me the money” has become a board-level question. You need a way to see what you’re spending and connect it to the value you’re getting.
This post is my attempt to make that easier. Specifically:
Before we get to measuring, here’s what’s changed and what hasn’t.
The most visible change is the volume of code, and the size of the pull requests carrying it. Across a sample of 1,450+ engineering organizations in Swarmia, median batch size — the lines changed per PR — roughly doubled between Q1 2025 and Q1 2026. For organizations with at least 1,000 PRs in each period, the median grew 97.5%; include smaller teams and it climbs to 109%. Most of that happened in the last six months, with batch size growing about 2.5x faster from October onward as agent adoption hit the mainstream. There’s simply more code moving through review queues, and it’s moving faster.
Last year’s DORA research found the same pattern: high AI adoption correlates with larger pull requests and longer code review times. Engineers produce code faster than ever, and code review hasn’t quite kept up.
Nobody uses just one AI tool. Most engineers run several at once: an agent in their terminal, a different one in their IDE, a chat interface in the browser, a cloud agent bot in Slack. Each operates at a different level of autonomy — inline autocomplete, conversational pair programming, fully autonomous pull requests. Someone might spend 30% of their day in conversational mode and 70% reviewing agent-generated PRs. Vendor dashboards capture a fraction of that, if any of it.
Engineering metrics don’t exist in isolation. If agents help teams write code faster, they’ll generate more pull requests. Larger pull requests at higher volumes slow down review. Slower review increases cycle time. The same tools that look like a productivity gain in one dimension can look like a loss in another. That’s the point: measurement has to be systemic. Any single metric, read in isolation, will mislead you. Of course, this isn’t exactly new knowledge, it’s always been this way.
The long-term risk hasn’t lessened as we’ve moved from AI-assisted code to agentic workflows. Engineers are still shipping code they don’t fully understand, which erodes their ability to reason about the system, fix it when it breaks, and bring new people up to speed.
None of this appears in throughput metrics. It surfaces later in incident rates, long debugging sessions, and turnover on the gnarliest parts of the codebase. Quality indicators and developer experience signals are how you catch it before it becomes expensive.
Early agent adopters are often already your highest performers — the engineers already working in the new way. They’re self-motivated, comfortable with new tools, and good at figuring out where AI helps and where it doesn’t. Measurements from this group won’t predict what happens when you roll out to everyone else.
Getting engineers to code 3x faster delivers about 2-5% more organizational output if everything else stays the same — that is, if they don’t adapt to the new way. The organizations seeing step-change improvements are redesigning the whole flow — how work is scoped, how agents are delegated to, how reviews happen, how maintenance is handled.
Speeding up coding is a local optimization, and local optimizations don’t move the whole. The 3x happens at the keyboard, then leaks away in planning, review, and the maintenance all that new code generates.
Okay, so that’s the picture. There’s no one metric, and things are still kind of fuzzy. Still, there’s a lot you can measure, if you take enough dimensions into account, and use the metrics to start conversations. Here are some of those dimensions, plus a few notes on what’s new with agents.
The most important new habit: track cycle time by stage. AI-generated code shifts the distribution with less time writing, more time in review. If review time is growing proportionally with PR volume, a new bottleneck is accumulating. If total cycle time is compressing across stages, something meaningful is happening.
Rework is getting a lot of attention right now, for a reason that’s specific to AI: the cost of writing code has dropped fast, while the cost of maintaining it hasn’t. When you can generate a feature in an afternoon, the bottleneck moves downstream — into the bugs, the rewrites, and the keeping the lights on (KTLO) work that accumulates behind everything you ship.
That makes it a more useful metric than it was a few years ago. Some organizations we’ve talked to now track rework rate over a six-month window, to see how much of their faster shipping they’re paying for later.
If you already track investment balance, you have a proxy for this. KTLO is rework by another name — the fixes and follow-ups left behind by code that didn’t hold up. When KTLO grows faster than feature work, you’re generating code faster than you’re generating code worth keeping.
Once you have meaningful agent activity, you can start tracking at the agent pull request level:
Swarmia’s agent metrics view surfaces these out of the box, and when you want to slice the data your own way, you can build custom views or just ask Swarmia AI for what you need in plain language.
One of the key risks of agent-generated code is knowledge concentration — where individual developers generate and ship code without proper team understanding. You can counter this by measuring:
For example, you might adopt a team working agreement to make sure that multiple team members collaborate on each significant feature. This helps maintain code quality while spreading knowledge organically through the team.
Speed gains mean nothing if quality is eroding at the same time, so keep an eye on:
There’s growing evidence behind the concern. CodeRabbit’s analysis of 470 open-source pull requests found roughly 1.7 times more issues in AI-coauthored pull requests than in human-written ones. And a Carnegie Mellon study of 807 GitHub repositories found that Cursor adoption raised cognitive complexity by about 41% and static-analysis warnings by about 30% — with the complexity sticking around even as teams grew more familiar with the tools.
Quantitative metrics alone will mislead you. Run developer surveys to gather qualitative signal:
That last question is worth asking directly, because engineers will tell you long before the bug backlog does. Consider looking at overall sentiment and individual experiences separately — the aggregate can mask important variation.
As adoption scales, spend becomes a number worth watching on its own. AI pricing is moving toward usage-based models, and those bills are far less predictable than a flat per-seat license — usage-based costs can climb fast in a single quarter. At the same time, the pressure to justify the spend is going up: someone above you wants to know what the AI budget is returning.
You won’t get a clean ROI multiple out of this, and you should be suspicious of anyone selling you one. But you can track what you’re spending per engineer and watch it against the impact you’re seeing. Rising cost is fine when impact is rising with it; the signal to act on is spend climbing while impact stays flat.
The catch is that the data is scattered. Each vendor has its own usage dashboard, they don’t talk to each other, and none of them connect spend to what shipped (ask us how we know)
So we built that view ourselves: adoption, usage, cost, and impact in one place, so you’re not stitching invoices and screenshots together by hand to answer a question from your CFO.
Knowing what to measure is one thing; getting value out of these tools is another. Here’s what tends to help teams get there.
Teams need dedicated time to experiment with and master these tools — particularly as agentic modes require different habits than autocomplete. Consider:
We’ve been on this for years. Before ChatGPT was available for businesses, our team built a Slack bot on top of the OpenAI API and ran it in a public channel, so everyone could learn from each other’s prompts and use cases. We ran our first internal AI Festival back in 2023 for the same reason.
More recently, the habit has been integrated into how we plan and build features: in sessions we call campfires, a small group from product, design, and engineering spends a few hours building — and sometimes shipping — the first iteration of a feature together, live. The point is to build the habit of sharing what works, and the usage and impact follows from that.
Agents only work well if the underlying delivery system supports them. As DORA’s research makes clear, AI amplifies what you already have — organizations with strong engineering practices see benefits, and those without see their bottlenecks made more visible.
I’ve argued elsewhere that engineers increasingly own the whole system that delivers a product — the CI, the tests, the agent context, the review automation that keep it running. Measurement can help tell you whether that system is healthy.
Some practical priorities:
Not all tasks benefit equally from autonomous agents. The highest-leverage candidates tend to be well-specified, isolated, and easy to verify: dependency updates, flaky test fixes, bug reports with a clear reproduction, documentation updates triggered by merged pull requests.
If you’re in the middle of a large migration project, agents can do a lot of the repetitive implementation work — the cognitive overhead of maintaining two frameworks simultaneously is significant, and being able to delegate the mechanical parts can be transformative.
Start with the tasks where “done” is easy to define and easy to verify. Build the muscle there before moving to more complex work.
Allow teams to find their own path to effective agent use. Some engineers and task types will benefit more than others — that’s expected. The goal is to find the patterns that work and make them easy to replicate.
A team’s experience can also flip quickly. Something that didn’t work six months ago might be a clear win today, because the underlying capabilities are improving faster than anything we’ve ever seen in software tooling. If a team had a rough experience with agents and stepped back, it’s probably worth checking again after a few months.
Without visibility, teams that aren’t using agents may assume nobody is. Share success stories and learnings across teams. Track and communicate adoption in ways that emphasize learning rather than comparison. Celebrate the uses of AI that improve how the organization works, beyond the impressive demos.
The concerns have changed as tools have improved. Two years ago, the worry was whether AI-generated code would be any good at all. Today, the code is pretty good, and the more common concerns are:
A good measurement setup gives you a clear picture of your delivery system: where the gains are, where new bottlenecks are forming, whether rework is piling up faster than you can pay it down, and whether quality is holding up as volume grows. No single metric does that. Look at only one and you’ll misread the rest.
So start with your baselines. Instrument your delivery system before you try to measure AI impact — you can’t know whether anything changed without knowing what normal looked like. Then buy the tools, build adoption, introduce more complex and involved use cases, and watch what happens.
AI tools will keep improving, and I’ll probably have to update this article again before the year is out. But the fundamentals of engineering effectiveness will not change, that much I’m sure about.
Subscribe to our newsletter
Get the latest product updates and #goodreads delivered to your inbox once a month.
