Head to Head: Claude Fable 5 vs ChatGPT 5.5
The open-source coding agent says Claude Fable 5 planned better, while GPT-5.5 matched it on execution at lower cost.
By RuntimeWire Staff ยท Published
Why it matters
Kilo's test reframes coding-agent economics around task routing: premium models may be most valuable for planning, not for every token of implementation.

Kilo Code published a fresh model test Saturday that lands at an inconvenient moment for Anthropic: Claude Fable 5, the model Kilo found strongest at planning, is already offline after a U.S. export-control directive.
The test, written by Darko and Job Rietbergen on the Kilo Blog, is not an independent benchmark. It is a company-run comparison inside Kilo Code CLI, Kilo's own agentic coding product. But the result is useful because it asks a sharper question than most model leaderboards: not which model is better end to end, but whether the expensive model needs to handle both the plan and the implementation.
Kilo's answer was no. In its test, Claude Fable 5 beat GPT-5.5 on planning, scoring 9.1 versus 8.3 on Kilo's rubric. But when both models were handed the same winning plan and asked to implement it from identical starting points, Kilo says both passed all 15 acceptance checks. The cost gap was the point: GPT-5.5 spent $6.30 on execution, compared with $16.66 for Claude Fable 5. Kilo says using Fable 5 for planning and GPT-5.5 for implementation produced the same service for 59% less than using Fable 5 for both phases.
That is the kind of result Sid Sijbrandij and Scott Breitenother's company needs to turn into product behavior. Kilo Code, co-founded by Sijbrandij, the GitLab co-founder, and Breitenother, founder of Brooklyn Data Co, is selling a model-agnostic workflow rather than a single-model coding assistant. Kilo says its product gives developers access to 500-plus models across VS Code, JetBrains, CLI, Slack and cloud agents; its public site also claims 3 million-plus Kilo Coders and more than 40 trillion tokens processed. Those are Kilo's figures, not independently audited counts.
The strategic claim underneath the post is simple: if planning is where the highest-reasoning model adds most of its value, a coding agent should make it easy to buy premium reasoning only at that stage, then switch to a cheaper model for execution. That framing directly supports Kilo's broader product pitch around model freedom and zero-markup routing. The Kilo GitHub repository, which describes Kilo as an all-in-one agentic engineering platform, showed 20.1 thousand stars and 2.7 thousand forks when accessed Saturday.
What Kilo tested
Kilo asked both Claude Fable 5 and GPT-5.5 to plan a feature flag service using Bun, Hono, TypeScript and better-sqlite3. The service had to support boolean flags, percentage rollouts, environment scoping, a cached evaluation path, audit logs and hashed API keys. The hard correctness trap was percentage rollout behavior: the same user needed to receive the same result for the same rollout percentage, and raising a rollout from 20% to 40% had to keep the original 20% enabled without storing per-user state.
Kilo says both models got the core algorithm right: hash the flag key and user ID into one of 10,000 buckets, then enable the user if the bucket falls below the rollout percentage. The difference was in the surrounding engineering judgment. According to Kilo, Fable 5's plan caught failure modes GPT-5.5 left out, including negative caching for missing flags and the need to clear a stale cache entry if the missing flag is later created. Fable 5 also specified pinned hash test values so that any future change to rollout math would break tests rather than silently reshuffle users.
GPT-5.5's plan was longer, Kilo says, at 1,456 lines versus 431, and broader operationally. But Kilo's authors argue that Fable 5 made more decisions instead of leaving product and security choices to the implementer. One example: GPT-5.5 specified bcrypt or Argon2 for API key hashing, while Fable 5 chose SHA-256 on the theory that 256-bit random API keys do not need password-style slow hashing. That is a contestable engineering decision, but it shows why Kilo weighted the plan phase heavily: a good plan removes ambiguity before an agent starts writing code.
The timing changed overnight
Kilo's post says it was written June 11 and published June 13. Between those dates, Anthropic lost the ability to offer the model that won Kilo's planning round.
Anthropic launched Claude Fable 5 and Claude Mythos 5 on June 9. Anthropic described Fable 5 as a Mythos-class model made available for general use with safeguards, while Mythos 5 used the same underlying model with certain safeguards lifted for selected cyberdefenders and infrastructure providers. Anthropic priced both at $10 per million input tokens and $50 per million output tokens.
By Friday, June 12, Anthropic said the U.S. government had issued an export-control directive requiring it to suspend access to Fable 5 and Mythos 5 for foreign nationals. The company disabled the models for all customers while it complied. AP reported that Anthropic took the models offline after the Trump administration directive, and Axios reported that Commerce Secretary Howard Lutnick sent Anthropic CEO Dario Amodei a letter applying export controls to Fable 5 and Mythos 5 outside the United States and to foreign persons inside the country.
That makes Kilo's test less of a Fable 5 buying guide and more of a workflow argument. The model that produced the best plan in Kilo's test may not be available to Kilo users today. The durable part of the result is the separation of labor: use the best available model for judgment-heavy planning, write the decisions down, then let a cheaper or more available model execute against that artifact.
The open question
Kilo's comparison does not prove that GPT-5.5 will match Fable 5 on every implementation task once handed a strong plan. It proves that, on one feature flag service with a prewritten acceptance suite, the plan carried enough detail that two models produced functionally interchangeable services. Kilo also says Fable 5 wrote roughly twice as many test lines and added one unprompted hardening check that GPT-5.5 did not, so lower cost did not mean identical engineering texture.
But the post is still a useful signal for where coding-agent products are going. The first wave sold completion and chat. The current wave sells delegation. The next fight is orchestration: which model should plan, which should code, which should review, and how much should a team pay for each step.
That is the commercial opening for Kilo. Sijbrandij and Breitenother are not trying to out-model Anthropic, OpenAI or Google. They are building the layer that decides when each model is worth using. Fable 5 disappearing from the market hours after Kilo finished testing it only sharpens that bet: in a world where frontier access can change by policy letter, the wrapper that can swap models without breaking the workflow becomes more important, not less.