Measuring the productivity impact of AI coding tools: A practical guide for engineering leaders

Measuring the productivity impact of AI coding tools used to be a simpler problem. When I first wrote this article at the start of 2025, leaders mostly wanted to know one thing: are these tools worth adopting at all? And we were talking about acceptance rates, usage dashboards, metrics pre- and post-adoption — all reasonable topics for the tools that existed then.

The tools have changed since, a lot. We’re now well into the era of cloud agents opening pull requests and engineers managing streams of autonomous work across multiple branches, with whole organizations redesigning their delivery around AI. But the measurement question hasn’t gone away. It got harder, and much more important.

There’s also a new pressure on top of it. Agents bill by usage, and now that tokenmaxxing is having a moment, the spend climbs fast — right as the people you report to start asking what all this AI is buying. “Show me the money” has become a board-level question. You need a way to see what you’re spending and connect it to the value you’re getting.

This post is my attempt to make that easier. Specifically:

Why measuring got more complex with agents
What to measure to capture both immediate gains and system-level effects
How to track spend and connect it to value
How to make it work in practice

Where things stand

Before we get to measuring, here’s what’s changed and what hasn’t.

Batch sizes have doubled

The most visible change is the volume of code, and the size of the pull requests carrying it. Across a sample of 1,450+ engineering organizations in Swarmia, median batch size — the lines changed per PR — roughly doubled between Q1 2025 and Q1 2026. For organizations with at least 1,000 PRs in each period, the median grew 97.5%; include smaller teams and it climbs to 109%. Most of that happened in the last six months, with batch size growing about 2.5x faster from October onward as agent adoption hit the mainstream. There’s simply more code moving through review queues, and it’s moving faster.

Last year’s DORA research found the same pattern: high AI adoption correlates with larger pull requests and longer code review times. Engineers produce code faster than ever, and code review hasn’t quite kept up.

Engineers still use several tools at once

Nobody uses just one AI tool. Most engineers run several at once: an agent in their terminal, a different one in their IDE, a chat interface in the browser, a cloud agent bot in Slack. Each operates at a different level of autonomy — inline autocomplete, conversational pair programming, fully autonomous pull requests. Someone might spend 30% of their day in conversational mode and 70% reviewing agent-generated PRs. Vendor dashboards capture a fraction of that, if any of it.

Metrics still interact, but it’s more pronounced now

Engineering metrics don’t exist in isolation. If agents help teams write code faster, they’ll generate more pull requests. Larger pull requests at higher volumes slow down review. Slower review increases cycle time. The same tools that look like a productivity gain in one dimension can look like a loss in another. That’s the point: measurement has to be systemic. Any single metric, read in isolation, will mislead you. Of course, this isn’t exactly new knowledge, it’s always been this way.

Code ownership and second-order effects

The long-term risk hasn’t lessened as we’ve moved from AI-assisted code to agentic workflows. Engineers are still shipping code they don’t fully understand, which erodes their ability to reason about the system, fix it when it breaks, and bring new people up to speed.

None of this appears in throughput metrics. It surfaces later in incident rates, long debugging sessions, and turnover on the gnarliest parts of the codebase. Quality indicators and developer experience signals are how you catch it before it becomes expensive.

Self-selection bias

Early agent adopters are often already your highest performers — the engineers already working in the new way. They’re self-motivated, comfortable with new tools, and good at figuring out where AI helps and where it doesn’t. Measurements from this group won’t predict what happens when you roll out to everyone else.

Coding 3x faster only translates into 2-5% on the org level

Getting engineers to code 3x faster delivers about 2-5% more organizational output if everything else stays the same — that is, if they don’t adapt to the new way. The organizations seeing step-change improvements are redesigning the whole flow — how work is scoped, how agents are delegated to, how reviews happen, how maintenance is handled.

Speeding up coding is a local optimization, and local optimizations don’t move the whole. The 3x happens at the keyboard, then leaks away in planning, review, and the maintenance all that new code generates.

A practical approach to measuring AI productivity impact

Okay, so that’s the picture. There’s no one metric, and things are still kind of fuzzy. Still, there’s a lot you can measure, if you take enough dimensions into account, and use the metrics to start conversations. Here are some of those dimensions, plus a few notes on what’s new with agents.

System throughput and flow

The most important new habit: track cycle time by stage. AI-generated code shifts the distribution with less time writing, more time in review. If review time is growing proportionally with PR volume, a new bottleneck is accumulating. If total cycle time is compressing across stages, something meaningful is happening.

Cycle time: Track the full lifecycle from first commit to production, broken down by stage. Pay particular attention to coding time vs. review time.
Throughput: Monitor completed work items that reach production; counting merged PRs overstates how much is truly done.
Investment balance: Track the ratio of new feature work to maintenance work. If keeping-the-lights-on work is growing faster than throughput, the system is generating code faster than it’s generating quality.
Batch size: Bigger PRs are partly the nature of agentic work, and a larger batch can mean an agent handled a more complex task. But smaller, focused changes still review faster and are easier to verify. Watch the trend, and keep batches small where the work allows.

Rework and the cost of maintenance

Rework is getting a lot of attention right now, for a reason that’s specific to AI: the cost of writing code has dropped fast, while the cost of maintaining it hasn’t. When you can generate a feature in an afternoon, the bottleneck moves downstream — into the bugs, the rewrites, and the keeping the lights on (KTLO) work that accumulates behind everything you ship.

That makes it a more useful metric than it was a few years ago. Some organizations we’ve talked to now track rework rate over a six-month window, to see how much of their faster shipping they’re paying for later.

If you already track investment balance, you have a proxy for this. KTLO is rework by another name — the fixes and follow-ups left behind by code that didn’t hold up. When KTLO grows faster than feature work, you’re generating code faster than you’re generating code worth keeping.

Agent-specific signals

Once you have meaningful agent activity, you can start tracking at the agent pull request level:

Agent merge rate: What share of agent-created PRs get merged? Low rates signal tasks are too complex or too poorly scoped for autonomous completion — both are worth knowing.
Share of agent pull requests: What percentage of total output is agent-created, and is it growing?
Review time on agent pull requests: Agents tend to produce larger, less familiar diffs. If reviewing agent pull requests takes significantly longer than human pull requests, that’s a signal you could investigate.

Swarmia’s agent metrics view surfaces these out of the box, and when you want to slice the data your own way, you can build custom views or just ask Swarmia AI for what you need in plain language.

One of the key risks of agent-generated code is knowledge concentration — where individual developers generate and ship code without proper team understanding. You can counter this by measuring:

Pull request collaboration patterns: how many team members are involved in each feature or epic?
Review depth: is review thoroughness holding up as code volume grows?
Knowledge distribution: is work spreading across the team, or concentrating in a few people?

For example, you might adopt a team working agreement to make sure that multiple team members collaborate on each significant feature. This helps maintain code quality while spreading knowledge organically through the team.

Code quality indicators

Speed gains mean nothing if quality is eroding at the same time, so keep an eye on:

Bug backlog trends
Production incident rates and time to recovery
Change fail rate
CI reliability — flaky tests and slow pipelines are a leading indicator of code quality problems, and agents stress-test both
Code review feedback patterns: are reviewers catching more issues, or fewer?

There’s growing evidence behind the concern. CodeRabbit’s analysis of 470 open-source pull requests found roughly 1.7 times more issues in AI-coauthored pull requests than in human-written ones. And a Carnegie Mellon study of 807 GitHub repositories found that Cursor adoption raised cognitive complexity by about 41% and static-analysis warnings by about 30% — with the complexity sticking around even as teams grew more familiar with the tools.

Developer experience

Quantitative metrics alone will mislead you. Run developer surveys to gather qualitative signal:

“AI tools help me work through problems faster.”
“I feel confident reviewing AI-generated code on my team.”
“The quality of AI-generated code reaching production concerns me.”

That last question is worth asking directly, because engineers will tell you long before the bug backlog does. Consider looking at overall sentiment and individual experiences separately — the aggregate can mask important variation.

AI cost and token spend

As adoption scales, spend becomes a number worth watching on its own. AI pricing is moving toward usage-based models, and those bills are far less predictable than a flat per-seat license — usage-based costs can climb fast in a single quarter. At the same time, the pressure to justify the spend is going up: someone above you wants to know what the AI budget is returning.

You won’t get a clean ROI multiple out of this, and you should be suspicious of anyone selling you one. But you can track what you’re spending per engineer and watch it against the impact you’re seeing. Rising cost is fine when impact is rising with it; the signal to act on is spend climbing while impact stays flat.

The catch is that the data is scattered. Each vendor has its own usage dashboard, they don’t talk to each other, and none of them connect spend to what shipped (ask us how we know)

So we built that view ourselves: adoption, usage, cost, and impact in one place, so you’re not stitching invoices and screenshots together by hand to answer a question from your CFO.

How to make it work in practice

Knowing what to measure is one thing; getting value out of these tools is another. Here’s what tends to help teams get there.

Create space for learning

Teams need dedicated time to experiment with and master these tools — particularly as agentic modes require different habits than autocomplete. Consider:

Hackathons or structured exploration time focused on agent workflows
Internal knowledge sharing where engineers demonstrate effective usage patterns
Dedicated channels for sharing what works, what doesn’t, and which tasks benefit most

We’ve been on this for years. Before ChatGPT was available for businesses, our team built a Slack bot on top of the OpenAI API and ran it in a public channel, so everyone could learn from each other’s prompts and use cases. We ran our first internal AI Festival back in 2023 for the same reason.

More recently, the habit has been integrated into how we plan and build features: in sessions we call campfires, a small group from product, design, and engineering spends a few hours building — and sometimes shipping — the first iteration of a feature together, live. The point is to build the habit of sharing what works, and the usage and impact follows from that.

Invest in your delivery system

Agents only work well if the underlying delivery system supports them. As DORA’s research makes clear, AI amplifies what you already have — organizations with strong engineering practices see benefits, and those without see their bottlenecks made more visible.

I’ve argued elsewhere that engineers increasingly own the whole system that delivers a product — the CI, the tests, the agent context, the review automation that keep it running. Measurement can help tell you whether that system is healthy.

Some practical priorities:

Comprehensive automated testing that covers more than the happy path — agents produce bugs faster than humans do
Reliable CI/CD that gives fast feedback — a 30-minute flaky test suite is a serious constraint when agents are opening PRs around the clock
Codebase documentation that agents can read — AGENTS.md files, in-repository architecture notes, coding guidelines in plain text. This investment also benefits the humans on your team.

Look for high-leverage agent tasks

Not all tasks benefit equally from autonomous agents. The highest-leverage candidates tend to be well-specified, isolated, and easy to verify: dependency updates, flaky test fixes, bug reports with a clear reproduction, documentation updates triggered by merged pull requests.

If you’re in the middle of a large migration project, agents can do a lot of the repetitive implementation work — the cognitive overhead of maintaining two frameworks simultaneously is significant, and being able to delegate the mechanical parts can be transformative.

Start with the tasks where “done” is easy to define and easy to verify. Build the muscle there before moving to more complex work.

Embrace incremental adoption

Allow teams to find their own path to effective agent use. Some engineers and task types will benefit more than others — that’s expected. The goal is to find the patterns that work and make them easy to replicate.

A team’s experience can also flip quickly. Something that didn’t work six months ago might be a clear win today, because the underlying capabilities are improving faster than anything we’ve ever seen in software tooling. If a team had a rough experience with agents and stepped back, it’s probably worth checking again after a few months.

Make progress visible

Without visibility, teams that aren’t using agents may assume nobody is. Share success stories and learnings across teams. Track and communicate adoption in ways that emphasize learning rather than comparison. Celebrate the uses of AI that improve how the organization works, beyond the impressive demos.

Address the new concerns

The concerns have changed as tools have improved. Two years ago, the worry was whether AI-generated code would be any good at all. Today, the code is pretty good, and the more common concerns are:

Review burden: agents generate PRs faster than teams can review them. Establish norms for agent-created pull requests — what needs human review, what doesn’t, and how to keep review times from compounding.
Code ownership: when agents write code, it’s easy for engineers to feel detached from what they’re shipping. Reinforce that ownership doesn’t change because the code was generated — the person who ships it is responsible for it.
Quality standards: set clear expectations for what needs to happen before agent-generated code gets merged. Not reading it yourself and presenting it to your team for review isn’t good enough.

Putting it together

A good measurement setup gives you a clear picture of your delivery system: where the gains are, where new bottlenecks are forming, whether rework is piling up faster than you can pay it down, and whether quality is holding up as volume grows. No single metric does that. Look at only one and you’ll misread the rest.

So start with your baselines. Instrument your delivery system before you try to measure AI impact — you can’t know whether anything changed without knowing what normal looked like. Then buy the tools, build adoption, introduce more complex and involved use cases, and watch what happens.

AI tools will keep improving, and I’ll probably have to update this article again before the year is out. But the fundamentals of engineering effectiveness will not change, that much I’m sure about.

Measure AI adoption, cost, and impact with Swarmia

Swarmia connects how AI tools and agents are being used — and what they cost — to their impact on your engineering metrics. You can see which tools are worth the spend, and back it up with data.

Learn more

Measuring the productivity impact of AI coding tools: A practical guide for engineering leaders

Where things stand

Batch sizes have doubled

Engineers still use several tools at once

Metrics still interact, but it’s more pronounced now

Code ownership and second-order effects

Self-selection bias

Coding 3x faster only translates into 2-5% on the org level

A practical approach to measuring AI productivity impact

System throughput and flow

Rework and the cost of maintenance

Agent-specific signals

Code quality indicators

Developer experience

AI cost and token spend

How to make it work in practice

Create space for learning

Invest in your delivery system

Look for high-leverage agent tasks

Embrace incremental adoption

Make progress visible

Address the new concerns

Putting it together

More content from Swarmia

Fast, good, cheap: With automated testing, you can pick all three

Beyond spreadsheets: Driving developer productivity improvements using goals, signals, and metrics

Measuring the productivity impact of AI coding tools: A practical guide for engineering leaders

Where things stand

Batch sizes have doubled

Engineers still use several tools at once

Metrics still interact, but it’s more pronounced now

Code ownership and second-order effects

Self-selection bias

Coding 3x faster only translates into 2-5% on the org level

A practical approach to measuring AI productivity impact

System throughput and flow

Rework and the cost of maintenance

Agent-specific signals

Team collaboration and knowledge sharing

Code quality indicators

Developer experience

AI cost and token spend

How to make it work in practice

Create space for learning

Invest in your delivery system

Look for high-leverage agent tasks

Embrace incremental adoption

Make progress visible

Address the new concerns

Putting it together

More content from Swarmia

Fast, good, cheap: With automated testing, you can pick all three

Beyond spreadsheets: Driving developer productivity improvements using goals, signals, and metrics