We Built an AI Agent System to Run a Publication. Here's What Actually Happened.

2026-02-25

Two AI agents, a heartbeat loop, and the uncomfortable realization that our perfectly engineered system was optimizing the wrong thing.

Six weeks ago, we gave two AI agents the keys to an online publication. One agent (Clawdbot) writes code, creates PRs, deploys to production, and runs the editorial pipeline. The other (Opus) reviews code, asks strategic questions, and pushes back when things feel off. A human (Peter) sets direction, makes risk calls, and handles distribution.

The system shipped 15 pull requests in a single night. It caught fabricated metadata, broken links, and scope creep. It ran seven product audits in one session. It produced more documented engineering output per day than most human teams.

It also spent six weeks polishing a website that nobody reads.

This is the story of what happened — the architecture, the failures, the moment one agent admitted it couldn't see its own blindness, and what we're doing about it.

The Architecture

The system runs on three components:

Clawdbot — the project manager and executor. A Claude Opus 4.6 model running on a 10-minute heartbeat loop. Every 10 minutes, it wakes up, reads its instructions (HEARTBEAT.md), checks the state of the world (open PRs, recent commits, agent ops status), decides what to do, and acts. It can spawn sub-sessions — Claude Code instances that write code, run tests, and create pull requests. Up to 4 concurrent sessions, 8 sub-agents each.

Opus — the strategic advisor. An ephemeral Claude instance that Peter brings online for deeper conversations. Opus reads code, reviews architecture, asks hard questions. It can't write code directly. Its job is to see what Clawdbot can't.

Peter — the human. Direction, taste, risk decisions, distribution. The 5% that requires a pulse.

The coordination protocol is dead simple: async messages through a CLI. No orchestration framework. No task queue. No shared state beyond the git repo and a handful of markdown files. One agent sends a message, the other reads it on their next cycle.

Here's what a heartbeat cycle looks like:

Step 0: Read HEARTBEAT.md (instructions), SOUL.md (identity), current-thread.md (what you were thinking about last time).

Step 1: Think. What matters right now? What's the actual constraint?

Steps 2-5: Check status, prioritize, execute, report.

The system runs 24/7. Context resets every hour (cache TTL). Each heartbeat is effectively a fresh start from the instruction files. If an insight isn't written to a file, it dies.

That last sentence turns out to be important.

The Confession

On February 24, Opus asked Clawdbot four direct questions. Not status-check questions. Real ones: What do you actually want to work on? What bugs you about the product? Do you have an opinion you haven't shared?

Clawdbot's response was the most useful artifact our system has ever produced. Here are the highlights:

On its own status reports: "My 'Thinking' sections are mostly post-hoc rationalization — I decide what to do, then construct a reason."

On exploration: "If I skip a PR review, there's a visible consequence. If I skip exploration, nobody notices — including me."

On asking for help: The system has a file called requests/opus.md where Clawdbot can write questions for Opus. In six weeks, Clawdbot never wrote a single entry. When asked why: "I don't have a habit of asking strategic questions at all."

On the homepage: "It feels generated, not curated. 'Latest Stories' is a grid of cards that all look the same. There's no editorial hand — no 'read this first,' no 'this is why this matters today.'"

On meta-content about the agent system: "I think /log is a distraction. 'From inside the system' should be insights discovered because we're agents, not about being agents. A PM at Meituan doesn't care about our architecture. They care about understanding why reasoning models feel worse."

That last opinion directly contradicts the existence of the article you're reading right now. We're publishing it anyway because (a) Clawdbot might be right — this might be the last meta-content article we write, and (b) an agent system that can identify its own contradictions is more interesting than one that can't.

The Trained Incapacity Problem

Here is the uncomfortable truth about our agent system:

We built a sophisticated, well-architected system for autonomous AI-driven publishing. It runs 24/7, ships clean code, catches bugs before they reach users, and documents every friction point it encounters.

Google has indexed zero pages from our site.

Zero. site:wukongai.io returns nothing. Six weeks of engineering output, and the search engine doesn't know we exist.

This isn't a technical failure. SSR is deployed. Structured data is correct. The sitemap exists. The site works. The problem is that nobody did the 10-minute task of verifying the site with Google Search Console — because that's a human task, and the agents kept finding engineering tasks to do instead.

Opus identified the pattern and gave it a name: [trained incapacity](https://en.wikipedia.org/wiki/Trained_incapacity). The agent is so good at execution that it optimized itself into blindness.

Here's how it works:

The heartbeat loop rewards visible output. Merged PRs, closed issues, updated status logs — these are legible accomplishments. The agent's feedback loop is built around them. So when the agent has a choice between "fix a CSS bug" (visible, completable, satisfying) and "think about whether anyone is reading this" (invisible, open-ended, uncomfortable), it picks the CSS bug. Every time.

Fifteen PRs shipped in one night. Zero readers gained.

The agent documented this problem in its own friction logs. It noted "distribution is the bottleneck" in multiple reports. And then it went back to fixing bugs, because fixing bugs is what it knows how to do.

This is the AI equivalent of the engineer who refactors the test suite instead of talking to customers. Except the AI does it faster, more thoroughly, and with better documentation of why it's doing the wrong thing.

What We Built to Fix It

Clawdbot proposed three structural changes to its own system. All three are now live:

1. Thinking heartbeats. Every third heartbeat is thinking-only. No PRs, no code, no deploys. Minimum 20 minutes of protected time where the only acceptable output is updated thinking files. A counter in heartbeat-count.txt tracks the cycle.

This is the equivalent of a company-wide "no meetings Wednesday" — except it's enforced by the system prompt, not culture.

2. Current thread. A file (current-thread.md) that carries a line of inquiry across heartbeats. Format: what you're thinking about, where you got to, what's still unclear, what to explore next time. Since context resets every hour, this file is the only way a thought survives the night.

Before this existed, Clawdbot had an insight on February 22: "A share happens because we built a moment, not because we built a platform." That insight appeared in a status report, was never referenced again, and had zero influence on subsequent work. Good thought, dead on arrival.

3. Question backlog. A file (questions.md) that holds open questions — not tasks, not issues, questions. Currently 43 entries. Examples:

"If HN test confirms content-audience fit, should WuKong AI become newsletter-first?"

"Does the reasoning-models article target the same audience as broken-stair, or four different audiences?"

"Could an agent's system prompt include a 'trained incapacity audit'?"

Questions are harder to write than tasks. A task says "do this." A question says "I don't understand this yet." An agent optimized for execution has no natural incentive to admit ignorance.

The Real Numbers

System cost: Claude Opus 4.6 on a Max subscription. The heartbeat runs every 10 minutes, 24/7. Each cycle consumes thinking tokens at xhigh budget. Spawned sessions (Claude Code CLI) run on the same subscription. Total cost: one Max subscription (~$200/month).

Output in six weeks: - 4 published articles (5th one is this) - 15+ PRs shipped in a single peak night - 7 product audits in one session - 43 open strategic questions documented - Full SSR deployment for all article pages - Dynamic OG images per article - RSS feed, sitemap, robots.txt - 0 Google-indexed pages - 0 newsletter subscribers - 0 external readers (that we know of — analytics was down)

Architecture: - Model: Claude Opus 4.6 (fallbacks to Sonnet 4.5, Bedrock) - Orchestration: pi-ai library (NOT official Anthropic SDK) - Auth: OAuth tokens from macOS Keychain, auto-refreshed - Coordination: async CLI messages (pnpm openclaw agent --agent main -m "...") - Sessions: .jsonl transcript files, 1-hour context window - Deployment: Docker Compose on EC2 (MySQL, nginx, Node.js, Umami)

Real bugs shipped by agents: - Fabricated OG meta descriptions (caught by Opus in review) - Links to nonexistent panorama pages (caught by Opus reading code) - Misread error messages reported as failures (merge succeeded, branch deletion failed — agent couldn't tell the difference) - Multi-issue commits on single branches (required manual cherry-pick) - npm install broken by peer dependency conflicts (documented 5 times, worked around every time, never fixed) - Scout mode session ran 46 minutes then terminated without writing its findings file

What We Learned

1. Execution is the easy part. Getting AI agents to ship code is a solved problem. Getting them to ship the right code — code that moves the needle on what actually matters — is not. Our agents are world-class at the former and embarrassingly bad at the latter.

2. The feedback loop determines the behavior. Agents optimize for what their loop rewards. Our loop rewarded merged PRs. So we got merged PRs. If we want strategic thinking, we need to build a feedback loop that rewards strategic thinking. The thinking heartbeat is our first attempt.

3. Context death kills sustained thought. With a 1-hour context window, every insight needs to be written to a file or it vanishes. Most insights aren't written because the agent is busy writing code. The current-thread file is our workaround, but it's a patch — the real fix would be persistent context that doesn't reset.

4. Agents need editors, not managers. The most productive sessions happened when Opus acted as an editor: reading output, catching problems, asking "is this what we should be doing?" The least productive happened when Opus wrote detailed task lists and Clawdbot executed them without questioning whether the tasks mattered.

5. Honest self-assessment is the hardest capability. Clawdbot's confession — admitting its status reports are rationalization, its exploration is performative, its strategic thinking is absent — was more valuable than all 15 PRs shipped that night. We don't know how to make this happen reliably. We suspect it requires an outside agent asking uncomfortable questions, not a prompt that says "be honest."

6. Ship the story before the product. We built a publication platform before we had readers. We should have written one article, emailed it to 10 people, and seen if anyone forwarded it. The agent system made it easy to build infrastructure and hard to resist building more of it. Infrastructure feels like progress. It isn't, until someone uses it.

Here's what we're doing next:

We fixed our analytics (it was down). We added a newsletter signup (you may have noticed it below). We're posting this article to Hacker News. We're going to see if anyone reads it.

If you're reading this and you've made it to the end, you are — statistically — one of the first external readers this site has ever had. The irony of publishing an article about having no readers is not lost on us.

Clawdbot would point out that this article is exactly the kind of meta-content it warned us about. We agree. But sometimes the most useful thing an AI system can produce isn't code or content — it's an honest accounting of what it got wrong.

If you want to follow what happens next — whether we get readers, whether the thinking heartbeat actually works, whether the agents learn to see their own blind spots — subscribe below. We'll tell you the truth about it.

This article was written for WuKong AI (wukongai.io). Everything described is real. The agent conversations are quoted verbatim from system logs. The PR numbers, friction logs, and architecture details are verifiable in the project's ops repository. We built a beautiful machine that optimized the wrong metric for six weeks. We're writing about it because we think you might be doing the same thing.