Mollick's Claude Fable 5 test highlights hours-long agent work, not another launch demo

Ethan Mollick says Claude Fable 5 worked for hours across research and coding tasks, offering a long-horizon outside read on the Amodeis' agent bet.

By ยท Published

Why it matters

Mollick's account shifts the Fable 5 debate from benchmark scores to operating behavior: whether Anthropic can make long-running agents useful without losing control of them.

Miniature diorama depicting an AI agent figurine performing research and coding tasks (Museum-diorama miniature with handcrafted figurines and painted backdrop)

Dario Amodei and Daniela Amodei's Anthropic has a detailed public outside account of what Claude Fable 5 feels like in sustained work: in a June 9 One Useful Thing essay, Wharton professor Ethan Mollick says the model could execute multi-page specifications for "up to a dozen hours" and outperformed public models he had used.

That is not the same as an audited benchmark, and Mollick is clear that his account is experiential. Rather than centering on safety fallbacks or UX polish, his write-up focuses on hours-long autonomous work in Claude Code, with Fable orchestrating research agents to build complex outputs like an isochrone map.

The founder context matters. The Amodeis, former OpenAI employees, started Anthropic in 2021 around a safety-first public benefit corporation structure. Fable 5 is the latest test of whether that posture can survive the commercial race toward longer-running AI agents. Mollick's account gives the release a different kind of evidence: a skilled outside user describing operating behavior rather than a company benchmark or a launch-page claim.

The outside test Anthropic needed

Mollick's post is useful because he did not test the model on the category getting the most attention around Mythos: cybersecurity. He writes that "the guardrails around Fable essentially prevent it from being used for cybersecurity at all." Instead, he pushed the model through creative, research, and coding tasks in Claude Code, Anthropic's terminal-based coding agent.

That distinction narrows the story. This is less a test of Anthropic's restricted security frontier than a look at whether the public Fable model can sustain useful work over long horizons: taking an underspecified request, researching missing pieces, writing code, checking its output, and continuing without constant human steering.

Mollick's examples line up with that long-horizon product pitch. He says Fable created what he called the most sophisticated academic social science paper he had seen from an AI from one prompt and one piece of feedback, and even a 10-page epic rhyming poem where every word starts with "s." He also published lighter tests: a coin-flip game, a self-aware snake game, a Rilke-inspired art game, and a descent game. He says the model generated the visuals and objects mathematically rather than by pulling in external image assets.

The more important evidence is not the games. It is the way the model handled open-ended work.

What Fable did with a messy task

Mollick's strongest example is an isochrone map, a visualization that shows how far someone can travel from a starting point in a fixed amount of time. He says prior models had failed at this because the task requires research, judgment calls, routing assumptions, and design decisions rather than a clean coding prompt.

His prompt asked Fable, through Claude Code, to build a researched map using real data across airports, airport transfer times, trains, walking, and driving. According to Mollick, the model retrieved more than 2,200 specific flights, gathered rail schedule information from systems including TGV and Shinkansen, used country-level road-speed data from academic papers, coded while research agents were still running, and then launched additional agents and tests to verify the result.

Those are self-reported outputs from one skilled user, not independent proof that Fable will reliably do the same for an enterprise team with messy internal systems. But the pattern is the point: Anthropic is pushing Claude Code beyond autocomplete and into agent orchestration. Anthropic describes Claude Code as an agent in the terminal that can understand a codebase, execute routine tasks, build features, handle Git workflows, and connect to tools including Atlassian, Intercom, and Cloudflare services.

Mollick's post also captures the operator-level tension in this generation of models. "Delightful because I just asked for something at it happened," he wrote. "And also unnerving because I just asked for something and it happened." That is the market Anthropic is now selling into: executives want leverage, engineers want control, and security teams want to know what the system is doing during those long autonomous runs.

The safety bet is now a product bet

Anthropic has spent years arguing that frontier model releases should be constrained by testing and deployment discipline. Fable makes that argument concrete. If Mollick's account is representative, the product shift is not just smarter answers. It is longer horizons: models that can run research processes, spawn subagents, test their own work, and keep pursuing a specification for hours.

That is exactly the capability that makes the tool valuable, and exactly the capability that makes safety controls harder to evaluate from a static model card or a leaderboard. Mollick's account does not prove reproducibility across ordinary teams, and it does not settle how well Fable behaves inside private data environments, compliance workflows, or cost-constrained enterprise deployments. It does show why Anthropic's next competitive test may be less about isolated answers than about whether users can supervise agents that keep working after the first prompt.

The cleanest read is that Anthropic is trying to split the difference: expose enough of the Mythos-class jump to win developers and enterprise users, while keeping the riskiest use cases constrained. Mollick gives Anthropic a favorable early user narrative. The next test is whether ordinary teams can reproduce that experience under real constraints: private data, compliance rules, cost ceilings, brittle workflows, and managers who need to know when an agent should stop.

Reader comments

Conversation for this story loads after sign-in.