Vanity Metrics in Engineering, From Lines of Code to AI-Generated Percentages

Posted on May 8, 2026

Garry Tan, CEO of Y Combinator, announced that 25% of Winter 2025 startups had 95% of their code generated by AI. The internet applauded. I read it twice to make sure that was actually the claim.

"% of code generated by AI" is a vanity metric. It is not the first one in the history of engineering, and it will not be the last. But it is the first to be narrated publicly by one of the Valley's most influential outlets as if it were a benchmark of technical maturity. That is where it becomes a problem.

This post has 3 sections. The first shows why this metric is vanity, and why this one specifically is more dangerous than the ones before it. The second shows what to measure instead (spoiler: the wheel was invented in 2014). The third is an honest question about the state of the discipline.

1. The genealogy of wrong metrics

Lines of code per day, 1970s. FTEs per project, 1990s. Story points, 2010s. Now, "% AI-generated". Each one looked technical at the time. Each one aged into an HR metric.

The pattern is the same: measuring activity, not outcome. How much code was typed, how many people were on the project, how many points were burned in the sprint, how much the AI wrote. None of these metrics answers the question that matters: does the system stay up when the user shows up?

The difference this time is the stage. Lines of code per day stayed in the HR spreadsheet. Story points stayed in each team's Jira. "% AI-generated" is in the Demo Day announcement, on CNBC, on every founder's LinkedIn. It becomes a cultural benchmark, not an internal one. And cultural benchmarks change behavior of people who have not even entered the game yet.

Tan's own CNBC interview contradicts the headline he created. In the same conversation, he says founders need classical coding training to sustain products long-term, and that startups later bring in engineers to rebuild critical parts using traditional methods. It is all in the transcript. That part just did not go viral.

Who wrote the code matters far less than who understands it when it breaks. I have written about this before: "the AI wrote it" is not an answer when production is down. A metric that celebrates typing ignores the very thing that defines real engineering, which is who picks up the bill afterwards.

2. DORA has existed for over a decade

The wheel was invented in 2014. DORA (DevOps Research and Assessment) proposed 4 metrics that measure what matters: deployment frequency, lead time for changes, change failure rate, time to restore service. None of them ask who typed the code. All of them measure how the system behaves when you change it.

Why DORA works and "% AI-generated" does not:

  • DORA measures outcome, not activity. Deployment frequency measures whether the team ships. Time to restore measures whether the system is resilient. Failure rate measures whether the change was safe.

  • DORA is falsifiable. You can look at two teams and tell which one is doing better without ambiguity.

  • DORA is tool-independent. Vendors change, models change, frameworks change. DORA keeps measuring the same thing.

"% AI-generated" fails on all three points. It measures activity, not outcome. It is not falsifiable (95% generated can be a great product or a terrible one, the metric does not say). And it is fully dependent on tools that change every week.

The DORA Report 2025 revealed an important paradox: higher AI adoption correlates with more throughput AND more instability at the same time. Teams that celebrated only the throughput called victory too early. The system accelerates, and it breaks more. Without cross-checking the 4 metrics, you are watching half the movie.

At Buser, every rollout of an AI tool goes through DORA before we expand it. It does not matter whether the team typed 5% or 95% of the code. What matters is whether MTTR went down, whether change failure rate went up, whether deployment frequency increased without an incident cost. When a tool raised instability during a rollout we ran, we rolled it back. The percentage of generated code never entered the conversation. Not because we ignore it, but because it does not answer any useful question.

DORA also needs to be cross-checked with cost, or it becomes productivity theatre. I have written about cost discipline in AI: the problem is not the price of the token, it is the discipline of crossing DORA with cost per feature shipped, not per dev, not per hour. Without that, you accelerate expensively and think you are winning.

3. What this moment reveals about the discipline

The question is not whether AI writes good code. It does, in many cases. The question is why, in 2026, we are still celebrating metrics that measured activity in 1975.

Honest hypothesis: because "% AI-generated" fits in a headline. DORA does not. The incentive of whoever narrates (accelerator, press, vendor) is the metric that becomes a tweet, not the metric that holds the system up at 3am. The industry repeatedly confuses what is narratable with what is important.

Software is still about who picks up the bill when it breaks. A metric that does not capture that is theatre, well-produced, with lights, with a narrator, but theatre. The team that confuses one for the other pays for it later, in incidents, in rewrites, in people leaving because nobody understands the system anymore.

DORA has existed for over a decade. It is not the metric that aged, it is the will to measure. Every generation of engineers has to rediscover what the previous one already proved. The difference is that, this time, the wrong metric has a bigger marketing budget.


If you are a founder or CTO in 2026 and you still cannot answer your MTTR and change failure rate from memory, but you know how much of your codebase was generated by AI, you do not have engineering discipline. You have a pitch narrative.