Claude Fable 5 tops Remote Labor Index with 16.10% automation score

The Scale AI and Center for AI Safety benchmark tests agents on freelance projects judged against accepted human work.

By Ryan Merket · Published Jul 1, 2026, 2:23pm CT

Why it matters

RLI turns the AI labor debate into a harder test: whether an agent can produce client-acceptable freelance deliverables, not just pass exams.

Claude Fable 5 tops Remote Labor Index with 16.10% automation score — The Scale AI and Center for AI Safety benchmark tests agents on freelance projects judged against accepted human work.

Chubby (@kimmonismus) surfaced a sharp new labor-automation data point on Wednesday: Anthropic's Claude Fable 5 is now listed at 16.10% on the Remote Labor Index, putting it well ahead of every other model on a benchmark built from real freelance work.

The score matters because RLI is not another exam-style test of reasoning or coding snippets. The benchmark, created by Scale AI and the Center for AI Safety (@CAIS), uses 240 self-contained projects sourced from professional freelancers, then asks whether an AI agent's deliverable would be accepted by a reasonable client when judged against the human-produced reference work. Scale's leaderboard defines the primary metric, Automation Rate, as the share of projects where the AI output meets or exceeds that human standard.

On the current public leaderboard, Fable-5 ranks first at 16.10%. Opus 4.8 is second at 8.33%, Codex GPT 5.5 is third at 6.25%, and an earlier Claude Opus 4.6 CoWork run is fourth at 4.17%. That makes Fable's score almost double the next listed model and more than six times the 2.5% ceiling reported when Scale introduced RLI in October 2025.

There is an important footnote. Scale says it evaluated 218 of RLI's 240 projects before access to Fable 5 was restricted by the U.S. government. The remaining 22 projects, Scale says, were spread uniformly across the benchmark rather than clustered in a single sector or difficulty band. Scale also says that even if Fable failed every missing project, its automation rate would still be 14.6%, higher than any other listed model.

That qualifier keeps the result from being a clean full-benchmark completion. It does not make the result easy to dismiss. RLI is designed around end-to-end commissioned work, not isolated skills. Each project includes a brief, input files, a professionally accepted human deliverable, and economic data. The dataset covers 23 Upwork domains and, according to Scale's leaderboard, represents $143,991 of human work. The projects average 28.9 hours of human completion time and $632.60 in value, with a median of 11.5 hours and $200.

The benchmark is measuring a narrower but more concrete question

RLI's framing is useful because it puts a harder denominator under the AI labor debate. A model that can write code, answer questions, generate images, and operate a browser can still fail when asked to produce the multi-file, client-ready deliverable that a freelancer was actually paid to make.

Scale's methodology reflects that constraint. The benchmark filters an initial pool of 550 projects down to 240 after review for completeness, reproducibility, and professional quality. It excludes work that requires physical labor, long-term evaluation, or direct client interaction. Ten projects are public for qualitative review; 230 remain private for official leaderboard scoring.

The evaluation is also manual. Scale says trained experts compare the AI deliverable with the human reference and apply a three-point score: fail, meet standard, or exceed standard. The Automation Rate counts projects scored as meeting or exceeding the human deliverable. Scale reports 94.4% inter-annotator agreement for that metric.

That setup explains both the force and the limit of the Fable result. A 16.10% score is not a claim that AI can automate 16% of the remote labor economy. It is a claim that, under this benchmark's project mix, environment, tools, and judging process, Fable's agent run produced acceptable work on roughly one in six evaluated projects. The benchmark excludes whole categories of work, and Scale itself notes that cost savings are zero when an agent fails a project.

The earlier RLI results show why the jump drew attention. At launch, Scale and CAIS said the best-performing agent, Manus, reached 2.5% Automation Rate, with other systems lower. The leaderboard's failure analysis said unsuccessful projects often broke down for practical reasons: low-quality outputs, incomplete deliverables, corrupt or incorrect files, and inconsistency across artifacts. Those are the failure modes that separate a polished demo from a finished piece of paid work.

Anthropic's release timing makes the score more than a benchmark story

The Fable result landed as Anthropic was bringing the model back online after a two-and-a-half-week access fight with the U.S. government. Anthropic launched Claude Fable 5 and Claude Mythos 5 on June 9, describing Fable as a Mythos-class model with safeguards for general use. On June 12, Anthropic said a U.S. export-control directive required it to suspend all access to Fable 5 and Mythos 5 for foreign nationals, and that the company disabled both models for all customers because it could not verify nationality in real time.

Anthropic said on June 30 that the export controls had been lifted and that Fable 5 would be available globally starting July 1 across Claude Platform, Claude.ai, Claude Code, and Claude Cowork. The company also said the June 12 order followed a report in which Amazon researchers found a way to bypass Fable 5's safeguards and prompt the model to identify software vulnerabilities. Anthropic said it trained a new safety classifier that blocks the reported technique in more than 99% of cases, while acknowledging that the classifier would increase false positives during ordinary coding and debugging.

That context matters for the RLI score because Fable's benchmark lead is arriving at the same moment Anthropic is trying to prove two things at once: that its most capable widely available model can do commercially useful work, and that it can be governed without being pulled from the market. The first claim is measured in performance. The second is measured in access, pricing, false positives, enterprise trust, and government tolerance.

For founders and operators, the practical read is narrower than the social-media framing. Fable's RLI score is evidence that frontier agents are improving on messy project work that looks closer to actual freelance output than most benchmarks. It is not evidence that remote teams can swap out workers wholesale. The benchmark still shows failure on the large majority of projects, and the best result is partial because Fable was cut off before the full run finished.

The strategic read is stronger. RLI gives AI labs, employers, and policymakers a live scoreboard for a class of work that used to be discussed mostly through anecdotes. If Fable can clear roughly one in six real freelance-style projects on a benchmark whose previous leaders were clustered near the floor, the competitive question shifts from whether agents can occasionally complete paid work to which categories become reliably automatable first.

Scale's own leaderboard points to those pockets. Earlier RLI successes concentrated in audio tasks, image generation, report writing, and data retrieval. The broader dataset includes video, CAD, graphic design, game development, audio, architecture, product design, and a long tail of other work. That mix is why the next model gains will matter less as abstract percentages than as category-by-category changes in what customers stop hiring humans to do.

Fable 5 has not crossed that line at scale. But 16.10% on RLI is the first result on this benchmark that makes the floor look less stable.

Why it matters

The benchmark is measuring a narrower but more concrete question

Anthropic's release timing makes the score more than a benchmark story

Reader comments