Kimi K2.7 ranks second behind Fable 5 and above GPT 5.5 xhigh in ErdosBench's mathematical research test

Przemek Chojecki's 14-problem smoke run puts Moonshot's new open-weight model behind Claude Fable-5-max and ahead of GPT-5.5 xhigh.

By Ryan Merket · Published Jun 14, 2026, 12:22am CT

Why it matters

Kimi K2.7 Code's ErdosBench result suggests open-weight models are pushing beyond cheap coding help into proof-aware research workflows, where hygiene matters as much as raw answers.

A stylized representation of an AI model's code reaching the upper tiers of a complex benchmark system (Gouache and ink editorial illustration — visible brushwork, muted natural palette, slight texture from the paper)

Przemek Chojecki (@prz_chojecki) put Kimi K2.7 Code in the top tier of his ErdosBench public smoke test on June 13, ranking Moonshot AI's new coding model second behind Claude Fable-5-max and ahead of GPT-5.5 xhigh.

Chojecki, the ulam.ai founder and mathematician who has been turning synthetic Erdos-style research problems into a model evaluation track, said the rerun covered 14 public problems and compared Kimi K2.7 Code with Qwen 3.7 Max, Grok 4.3 and the leading models from earlier runs. His judgment was not that Kimi won the set. It was sharper: Kimi was the strongest new entrant, with enough mathematical creativity to move the benchmark's reference status on one problem.

The result matters because Kimi K2.7 Code was released on June 13 as an open-source, coding-focused agentic model, not as a pure math product. Moonshot says K2.7 Code is built for long-horizon software engineering, uses a mixture-of-experts architecture with 1 trillion total parameters and 32 billion activated parameters per token, supports a 256K context window, and reduces thinking-token use by about 30% versus K2.6. Its Hugging Face model card lists a modified MIT license and describes the model as supporting text, image and video inputs.

ErdosBench is a different kind of test from the coding and agentic benchmarks Moonshot foregrounds in its launch materials. The public GitHub repository describes ErdosBench as a research-mathematics benchmark for whether AI systems can behave like useful mathematical research assistants: finding obstructions, applying known theorems, checking proof gaps, running finite experiments and avoiding overclaimed novelty. The public repository intentionally exposes only 14 smoke-test statements, not the full 226-problem corpus, private splits, answer keys or verifier internals.

That release boundary is central to how the Kimi result should be read. The public smoke set is reproducible and useful, but it is not the private leaderboard. The repository's own validator checks that all 14 expected problem numbers are present and that result rows have the required fields; it does not judge mathematical correctness. The ranking Chojecki posted is therefore an audited public-smoke comparison, not a blanket claim that Kimi is the best math model.

Within that narrower frame, the scorecard was still notable. Chojecki said Kimi K2.7 Code covered 13 of the 14 problems, produced accepted solved or settled results on Problems 1, 3, 4, 5 and 7, and had no rejected solved claims. He singled out Problem 3 as the standout: Kimi used a Green-Tao prime-arithmetic-progression block construction that upgraded the reference status to "superpolynomial but subexponential." That is the kind of result a research benchmark is supposed to reward because it changes the benchmark artifact rather than merely matching an expected answer.

Claude Fable-5-max remained first in Chojecki's comparison because it matched Kimi on the five accepted solved or settled core results while covering all 14 problems and producing broader accepted partial progress. GPT-5.5 xhigh through Codex remained his cleanest elite baseline, with 14 of 14 coverage, accepted solved or settled results on Problems 1, 4, 5 and 7, no rejected solved claims, and stronger proof hygiene around partial work. In other words, Kimi's run was not the broadest or cleanest. It was the most consequential new run.

Qwen 3.7 Max also performed strongly in Chojecki's rerun, covering all 14 problems and receiving accepted solved or settled results on Problems 1, 4, 5 and 7. But Chojecki marked it down for overstatement on several problems where proof gaps remained material. Grok 4.3 was not competitive on this smoke set, according to the same comparison: it covered 12 of 14 problems, had no accepted core solved or settled results, and had a solved claim on Problem 5 rejected.

The market signal is that open-weight Chinese labs are no longer just fighting the price-performance story in coding. They are encroaching on the proof-hygiene layer that frontier labs have tried to make their moat. RuntimeWire's previous Kimi K2.6 head-to-head with Ministral 3B framed Moonshot's gains around execution; this ErdosBench run shifts the comparison toward proof hygiene. Moonshot's own launch table still shows Kimi K2.7 Code trailing GPT-5.5 and Claude Opus 4.8 on several coding and agentic benchmarks. Kimi scored 62.0 on Kimi Code Bench v2 versus 69.0 for GPT-5.5 and 67.4 for Claude Opus 4.8, and 53.6 on Program Bench versus 69.1 and 63.8. But on MCP Mark Verified, Moonshot reports Kimi K2.7 Code at 81.1, ahead of Claude Opus 4.8's 76.4 and behind GPT-5.5's 92.9.

The ErdosBench run cuts through a familiar problem in model launches: company-owned benchmark tables tend to reward the behavior a lab optimized for. Chojecki's smoke set rewards a different trait, the ability to make mathematically useful progress without laundering guesses into proofs. That is why the absence of rejected solved claims matters as much as the number of accepted results. A model that can say less, cleanly, is often more useful to a researcher than a model that says more and forces a human to unwind bad proof structure.

Kimi K2.7 Code is still a fresh release, and the evidence base is narrow. Fourteen public problems are enough to catch failure modes and signal capability; they are not enough to settle a model hierarchy. But the direction is clear: Moonshot's new open-weight model did not merely look good on Moonshot's launch chart. In an outside math smoke test designed to punish overclaiming, it landed among the frontier systems and produced one of the run's benchmark-moving contributions.

Why it matters

Reader comments