Vincent van der Meulen says Claude Fable 5 generated 60+ ready-to-merge pull requests overnight

Vincent van der Meulen's prompt turns a model into a manager over Devin sessions, issue trackers, Slack and review bots.

By ยท Published

Why it matters

The claim is unverified as a benchmark, but the workflow shows where AI coding is heading: operators are building management systems for fleets of agents, not just asking one model to write code.

A pristine, vintage mainframe punch card, its perforations subtly arranged to form a contemporary code-like pattern or a stylized 'PR' symbol. (Photorealistic studio still life photography.)

Vincent van der Meulen says Claude Fable 5 produced more than 60 "ready-to-merge" pull requests overnight when he used it as an orchestrator for autonomous coding agents, a claim that is more useful as a management artifact than as a benchmark.

Vincent van der Meulen on X

The public evidence is not the private repository, the pull requests, or the merge log. It is van der Meulen's July 2, 2026 thread on X and the accompanying GitHub Gist, a long prompt titled "Middle manager - autonomous software factory." The Gist lays out a system in which Claude Fable 5 does not write code directly. It manages a queue, spawns Devin coding sessions, assigns model tiers, checks issue status, keeps pull requests alive through review, and escalates to humans through Slack only when a decision is actually blocked.

That distinction matters. The headline number, 60-plus PRs, remains self-reported. The Gist does not show the repository, the actual pull requests, the product area, customer impact, or whether those changes were merged into production. Van der Meulen's claim still captures the shift now hitting small engineering teams: the scarce skill is moving from writing prompts for one coding agent to building operating rules for many of them.

Van der Meulen is not presenting this as a vendor launch. He tagged @mainframe in the thread. Mainframe bills itself as a way to watch agents work. Van der Meulen's prompt shows the operating layer underneath that idea.

The prompt is a management system, not a coding trick

The Gist opens with a hard role boundary: the "middle manager" does not implement issues. It reads the issue tracker, dispatches coding sessions, monitors them, enforces the quality bar and keeps the human informed only when necessary. The issue tracker is defined as the source of truth, with Todo issues eligible for dispatch and Backlog items explicitly off-limits.

The workflow assumes one fresh agent session per issue. Cross-agent state lives in the issue tracker, not in chat history. Each worker is expected to move its own issue through the board, attach evidence, open a PR as a draft and mark it ready only after self-review, spec checks, CI, BugBot review and evidence are complete. The manager is told not to read large diffs, logs or codebases itself. It should delegate that reading to worker sessions and receive short reports back.

That is the sharpest detail in the document. Van der Meulen is treating the orchestrator's context window like a management resource. The prompt tells the manager to protect its own context, reread the manual and initiative description after every dispatch wave, and use verifier sessions to judge best-of-N candidates. In human terms, it is a staff engineer, EM and release manager collapsed into one long-running session.

The named stack reflects how teams are stitching together agent infrastructure from several vendors. Claude Fable 5 appears to be the orchestrator. Devin supplies child coding sessions. The workflow references an issue-tracker MCP, Slack MCP, Cursor's Bugbot for PR review, Figma MCP for design context and Limrun for iOS testing from cloud environments.

In a reply, van der Meulen said every agent gets "its own, isolated stack" and can test iOS through Limrun. He also said the run cost was "def > $1K," while adding that it could be optimized. That cost note is doing real work. Overnight throughput is easy to admire; the first budget shock lands when dozens of long-running agents each need sandbox time, model tokens, build minutes, review loops and mobile test infrastructure.

Claude Fable 5 made the timing possible

The timing is narrow. Anthropic launched Claude Fable 5 and Claude Mythos 5 on June 9, 2026, then said on June 12 that it had to suspend access to Fable 5 and Mythos 5 after a U.S. government directive. Anthropic said Fable 5 and Mythos 5 were available again on July 1 through a redeployment post. Van der Meulen's public thread appeared the next day.

Anthropic's launch framing helps explain why the workflow used Fable 5 as a manager rather than as one more coding worker. Anthropic said Fable 5 was built for longer, more complex tasks and that the model's lead over earlier Claude models grows as task length increases. The company also said Fable 5 could work autonomously for longer than previous Claude models and priced it at $10 per million input tokens and $50 per million output tokens.

The prompt tests that claim in a practical way. A long-running orchestrator has to maintain state across a board, notice stale PRs, avoid duplicating a human's work, keep review sessions open, rebase stacked changes, enforce evidence requirements and avoid drifting from the release plan. Those are boring tasks. They are also the tasks that decide whether agent-written code becomes shippable software or a pile of unattended branches.

The unsolved part is still review

Van der Meulen's replies are candid about the weak spots. He said the team often merges behind feature flags and then has a human clean up design by hand. In another reply, he wrote that design quality is "the part that's most lacking of all of this." The prompt itself draws the same boundary: agents can do front-end wiring, state machines, scaffolding, data hookup and rough placement, while pure visual polish stays with designers.

That is a useful constraint. The workflow is not claiming that agents can replace product taste or final judgment. It is pushing agents into the work that can be specified, tested, reviewed, rebased and stacked. The human role shifts toward writing better issue bodies, setting quality bars, reviewing batches and handling product decisions that the board cannot answer.

The open question is whether a night of 60 ready-to-merge pull requests actually helps or just moves the queue downstream. If the PRs are small, isolated and well tested, the manager prompt may compress days of implementation into a review session. If the PRs touch shared foundations, product behavior or design surfaces, the human reviewer may inherit a different bottleneck: deciding which machine-generated work deserves to land.

Van der Meulen's prompt anticipates that failure mode. It says a PR is not done when it is merely ready for review, and it warns against archiving sessions while PRs are still open because agents need to answer review follow-ups, reply to GitHub comments and rebase. That is the workflow's most mature assumption. Autonomous coding does not end at code generation. It ends when the review loop is closed.

For Mainframe, the episode also doubles as a product proof point. A company pitching a way to watch and understand agent work is using agent work at a volume that would overwhelm most teams' normal communication habits. The claim needs verification before anyone treats 60-plus overnight PRs as a production benchmark. The operating pattern is already concrete: voice notes become tasks, a long-running model becomes the manager, coding agents become workers, and humans move closer to the release gate.

Reader comments

Conversation for this story loads after sign-in.