Entelligence Research · May 2026
An analysis of 1M+ pull requests across 2,444 engineering organizations. More AI spend. More code volume. More production failures. We measured where AI engineering effort actually goes, how code review has responded to 2.6× volume growth, and why the reactive work treadmill keeps accelerating. The findings: $0.82 of every AI dollar is consumed before a single feature reaches users.
At the current trajectory, an engineering team spending $100,000/year on AI coding tools generates roughly $18,000 of shipped product value. The remaining $82,000 is consumed by the maintenance cycle those same tools are helping to accelerate. This is not because engineers are inefficient or the AI tools are bad. It is because there is no closed loop between production reality and the code being written.
| Category | Platform avg | P75 | P90 | What it measures |
|---|---|---|---|---|
| Reactive Engineering | $0.44 | $0.62 | $0.76 | Bug fixes + maintenance PRs |
| Code Rework | $0.27 | $0.38 | $0.55 | Code written and discarded within the week |
| Review Friction | $0.11 | — | — | Overhead from review that doesn't catch anything |
| Shipped Product | $0.18 | $0.10 | $0.06 | Net-new value that reaches users |
Nearly half of all engineering output on the platform is classified as reactive — fixing existing code or keeping existing systems running. At the median organization, 44% of every PR is reactive. At the 75th percentile it is 62%. At the 90th percentile, more than three-quarters of all engineering effort goes toward work that produces no net-new product. These are organizations that, for every feature built, are also burning three-quarters of their capacity on maintenance. More AI spend accelerates the volume on both sides, not just the features.
At the 90th percentile, organizations spend 4.2× more on reactive work than on building product. These organizations are not outliers — they represent the ceiling of what happens when AI volume grows without a quality feedback loop.
At the median, 25% of code written in any given week is overwritten or deleted before that week closes. This is not planned refactoring or technical debt cleanup — it is code that did not survive the sprint it was written in. For teams heavily using AI coding assistants, this reflects a structural gap: the AI generates code from local context (the file, the prompt, the immediate task) but not from production reality — which patterns have failed, which edge cases have already been tried and reverted, what the actual requirement turned out to be. At the 90th percentile, more than half of all code written each week is discarded.
Industry benchmark at 27% (Pluralsight/GitPrime). Median matches; P90 is 2× the benchmark.
Between February 16 and May 4, weekly PR volume on the platform grew from 2,525 to 6,654 — a 2.6× increase. Over the same period, reverted pull requests grew from 10 to a peak of 37 per week — a 3.7× increase. The failure rate is growing faster than output. Each revert triggers a bug-fix PR. Each bug-fix PR adds to the reactive work total. The 44% becomes 50%, then 56%. This is the compounding structure of the token maxxing trap.
Code review has not scaled with AI output volume. 48.5% of all PRs are approved in under 60 minutes — faster than any meaningful review could take place. Across the PRs reviewed by the Entelligence platform, comment-level data surfaces the structural breakdown: 80% of review comments are bot-generated, and only 21.6% of all comments are ever acted on. Bug and error comments — the highest-value category at 32% of all comments — are addressed at only 26%. The constraint is not effort or tooling. It is that review happens without production context.
| Metric | Value | Note |
|---|---|---|
| Avg comments per PR | 20.8 | total |
| — Bot comments | 16.7 | 80.2% of total |
| — Human comments | 4.1 | 19.8% of total |
| Comments addressed | 21.6% | platform avg |
| — Bot comments addressed | 23.3% | |
| — Human comments addressed | 15.0% | |
| Avg addressed rate · per reviewer | 16% | 781 reviewers · range 0–100% |
Across organizations with production error tracking connected, the issue severity distribution is not what most engineering leaders expect. 132 of every 1,000 tracked issues are Critical — service-breaking, data-corrupting, or security-exposing failures. A further 627 per 1,000 are High severity. Together, 3 in 4 production issues are serious enough to cause direct user impact. Critical issues fire 3.3 times on average before anyone catches them. These are not edge cases surfaced by careful monitoring — they are failures that have already reached users multiple times before being identified and logged.
| Level | What it means | Per 1,000 issues | Avg fires | Ratio to low |
|---|---|---|---|---|
| CRITICAL | Service-breaking failures, data corruption, security exposures — direct, immediate user impact | 132 | 3.3× | 2.6× |
| HIGH | Significant functional failures, performance degradation, or data inconsistency under normal operation | 627 | 1.3× | 6.5× |
| MEDIUM | Non-blocking bugs, degraded experiences, or edge-case failures with limited blast radius | 189 | 0.6× | 3.7× |
| LOW | Minor issues, cosmetic bugs, or non-impacting edge cases with no direct user harm | 51 | 0.2× | 1× |
1,543 match events · 13 organizations with production error tracking connected. “Merged anyway” = flagged by Entelligence, approved and merged without being fixed.
Engineering teams burn 44% of AI spend on bug fixes. Code reviewers don't learn from production. SRE agents remediate but can't prevent — neither delivers what engineering leaders actually need: reliability that compounds. Entelligence closes the full loop with a production intelligence world model — unifying code, incidents, observability, and customer signal into one living context graph so every fix compounds. The $0.44 shrinks. The $0.18 grows.