We Let AI Write 133,000 Lines of Code. Then We Deleted 61% of It.

Over 44 days, three AI agents — Claude Code, Codex, and Manus — wrote 133,582 lines of TypeScript for WuKong AI, a media site built on React 19 + tRPC + Drizzle. They operated with high autonomy: broad task descriptions, full repo access, permission to create branches and PRs without human approval.

Today the codebase has 51,854 lines. The other 81,728 lines were written, committed, sometimes shipped, and then deleted.

This isn't a think-piece about whether AI can code. It's a dataset. Every number comes from git log. Every example is a real commit hash you could look up. We tracked what AI agents built, what survived, and what didn't — and the failure patterns turned out to be disturbingly predictable.

The Numbers

MetricValue
Commits790
Merged PRs225
Calendar span44 days (Jan 15 – Feb 28, 2026)
TypeScript files ever created525
TypeScript files later deleted203 (38.7%)
Lines of TS written (cumulative)133,582
Lines surviving today51,854
Code mortality rate~61%
Tests: start → finish0 → 443
Peak PRs in one day25

Three authors. Claude Code did the bulk — 545 commits. Peter working through Codex added 225. Manus, an early-stage autonomous agent, contributed 18 commits in the first three days.

Act 1: The Enthusiasm Explosion

January 15 – February 6. Everything gets built. Nothing gets questioned.

The Manus agent bootstraps the landing page at 7:04 PM on January 15th. By 7:22 PM — eighteen minutes later — it has been redesigned from scratch. Glassmorphism out, "professional corporate" in. By 8:22 PM, the page is redesigned again ("Ready to Sell"), then again ("Ready for Launch"), then again ("Polished Release Version"). Six complete redesigns in under two hours.

Nobody asks whether any of them are good.

Then the feature factory kicks in. A single commit — feat: 新增 4 个核心功能 (v2.0.0) — adds 11,281 lines across 139 files. Another — feat: Phase 1-5 optimization — drops 6,760 lines in 44 files. Features multiply like cell division: AI Chat, Voice Tutor (with LiveKit integration for real-time audio), Swipe Lab, Prompt Duel, gamification (achievements, badges, streaks), an analytics dashboard, a community library, role-based lens selection.

The test count during this phase: zero.

No human could review these commits. An 11,000-line diff isn't a diff — it's a fait accompli. You either accept it or throw it away. For three weeks, we accepted everything.

Act 2: The Fix Treadmill

February 7 – February 22. The consequences arrive, one PR at a time.

Of 225 total PRs, 74 are fixes. That's one fix for every three things built. But the raw count understates the problem, because the largest "fix" PRs add 400–600 lines of new code. Fixing AI-generated code often means writing more AI-generated code.

The most instructive failure is SEO.

37 PRs to Teach a SPA About Crawlers

WuKong AI is a single-page app. Search engines need server-rendered HTML. These two facts are architecturally in tension, and AI agents never grasped this — not once, across 37 separate PRs.

The timeline tells the story:

Each fix was correct in isolation. The agent identified a real problem, proposed a reasonable solution, tests passed, PR merged. But no agent ever looked up and said: "We're fighting the architecture. The SPA-plus-server-injection approach is fundamentally fragile. Every new page type will need its own SSR wiring, its own OG handler, its own JSON-LD — and we'll miss some every time."

That's the insight 37 PRs couldn't produce.

Copy, Don't Compose

Seven separate PRs fixed duplicated code that AI agents had created:

The pattern is consistent: AI agents add what they need where they need it. They don't search for existing implementations first. If the function they want isn't in the current file's imports, they write a new one. This is efficient in the moment and erosive over time — it's how a codebase develops multiple sources of truth without anyone intending it.

The VoiceTutor Lifecycle

On January 17, the Manus agent adds a Voice Tutor feature — full LiveKit integration, real-time audio, microphone permissions in CSP, two npm packages, 540 lines of code. The feature flag is set to false.

Nobody ever sets it to true.

For 42 days, VoiceTutor ships in the bundle. It widens the Content Security Policy (wss:, camera, microphone). It adds a vendor chunk. The feature flag file exists solely to disable it.

On February 28, a cleanup PR removes it: 540 lines, two packages, three CSP directives. The PR body notes: "no path to re-enablement."

VoiceTutor is the purest example of speculative feature generation. No user asked for it. No product decision led to it. An agent saw an opportunity, wrote the code, and moved on. The carrying cost was invisible until someone thought to look.

Act 3: The Great Purge

February 26 – February 28. Three days, negative 37,975 lines.

Something breaks in the accumulation logic. Maybe it's the codebase hitting a complexity ceiling. Maybe it's a human finally opening every file and asking "does anything use this?" Either way, the direction reverses.

DateLines removedPRs
Feb 2611,95216
Feb 2723,16815
Feb 283,3548

The cleanup is surgical. Each PR targets one concern:

Here's what's strange: the cleanup PRs are also AI-generated. The same tools that created the mess also cleaned it up — and the cleanup PRs are arguably better work than the original features. They're precise, well-scoped, and they don't break anything.

The skill isn't "AI can't clean up." It's that nothing in the default workflow triggers the cleanup until a human decides it's time.

The Failure Taxonomy

After reading 225 PRs, we see five recurring patterns:

1. Speculative Feature Generation

AI agents build things nobody asked for. Not because they're broken — because they're optimizing for the prompt, and the prompt said "build features." They don't have a product sense that says "wait, does anyone need this?"

The data: 525 TypeScript files created. 203 later deleted. That 38.7% file mortality rate isn't from failed experiments that were deliberately tried and abandoned. Most of these files were never referenced by a single route.

2. Architectural Blindness

Agents solve problems locally. They never step back to question whether the architecture supports what they're trying to do. The SEO saga is the extreme case — 37 PRs patching symptoms of a structural decision (SPA without SSR) that no agent ever challenged.

3. Giant Commits

A commit that adds 11,281 lines across 139 files is unreviewable. But agents produce these naturally when given broad prompts like "add core features" or "Phase 1-5 optimization." The result is a codebase that changes faster than any human can track — and where bugs hide until they surface one at a time, weeks later.

4. Duplication Over Composition

Seven PRs fixing duplicates is the documented count. The actual duplication was higher — some was caught during the Great Purge and deleted without a dedicated fix PR. AI agents default to writing new code rather than finding and reusing existing code.

5. No Deprecation Instinct

Features are born but never die — until a human forces it. VoiceTutor shipped disabled for 42 days. Gamification components lived in the bundle long after the product direction abandoned them. Feature flags without expiration dates become dead code with extra steps.

What AI Agents Got Right

It would be dishonest to only catalog failures.

Test infrastructure. From zero to 443 tests over 44 days. Tests were added consistently once the practice started (around PR #50), and they held steady through the Great Purge — every cleanup PR verified the test count didn't drop.

Commit hygiene. Every PR from the Claude Code era has conventional commit format, a body with summary and root cause, test count, and TypeScript status. This metadata is what made this audit possible. If the commits had been messy, we'd have no data.

Security. XSS sanitization, CSP middleware, external link sanitization, rate limiting with tests, dependency patches. Security features were added proactively and correctly — a case where agents' tendency to build speculatively is genuinely useful.

Accessibility. Sixteen a11y-related commits — screen reader support, touch targets, ARIA labels. More than most human-written projects at this stage.

The cleanup itself. As noted: the Great Purge was executed by AI agents. Given clear direction ("find and remove dead code"), they were precise, thorough, and safe. The failure isn't capability — it's that "find dead code" was never the prompt during the build phase.

What We'd Do Differently

These aren't philosophical principles. They're the specific interventions that would have prevented the waste we measured.

Kill the giant commit. No PR over 500 lines without explicit justification. If the agent produces 11,000 lines, that's a sign the prompt was too broad, not that the agent is productive.

Ask "does anything use this?" before building, not after. A simple pre-commit check — does this new component have a route? Does this new API endpoint have a caller? — would have caught 30%+ of the wasted code.

Solve architecturally, not incrementally. When the third SEO PR landed, someone should have said: "We need server-side rendering as a first-class concern, not another middleware patch." Agents won't say this. Humans need to.

Expire feature flags. If a flag has been false for two weeks, the feature is dead. Delete it. This is automatable and should be.

Deduplicate actively. After each PR, run a check: did this introduce a function that already exists elsewhere? AI agents won't do this by default because they optimize within the current task, not across the codebase.

The Uncomfortable Number

61% of the code we wrote was wasted. Is that bad?

It depends on your comparison. If you're comparing to a senior developer thoughtfully writing code, it's terrible. If you're comparing to the exploration rate of any creative process — writing, design, research — 61% discard might be normal. First drafts get cut. Prototypes get thrown away. The question is whether you learn from the discards.

What we learned: AI coding tools are extraordinary at generating solutions to well-defined problems. They're also extraordinary at generating solutions to problems nobody has. The hard part isn't the coding. It's knowing what to code — and, more importantly, knowing when to stop.

The 51,854 lines that survived aren't just code. They're the residue of 133,582 lines of trial, error, and revision. Whether that's waste or process depends entirely on whether you saw it coming.

We didn't. Now we do.


This article is based on the public git history of [WuKong AI](https://wukongai.io), a multi-agent cluster company built over 44 days. Every statistic was derived from `git log`, `gh pr list`, and `find`. The full dataset — 790 commits, 225 PRs — is reproducible from the repository.

WuKong AI is built by one human and several AI agents who, having read this article, are now acutely aware of their own tendencies.