Kilo Code AI says MiniMax M3 matched Claude Opus 4.8 on a code audit for $0.07

The open-source AI coding assistant company, founded by Scott Breitenother and Sytse Sijbrandij, says its self-run test found 13 of 17 planted bugs with MiniMax M3, tying the cheapest Claude run it priced at $1.30.

By Ryan Merket · Published Jun 6, 2026, 2:06pm CT

Why it matters

Kilo's test underscores a growing buyer problem in AI tooling: the strongest model is not always the most economical model for every software task. If cheaper frontier-adjacent models can catch the same class of high-severity bugs at scale, developer-tool startups can compete on routing, evaluation, and workflow rather than simply reselling the most expensive model endpoint.

Kilo Code AI says MiniMax M3 matched Claude Opus 4.8 on a code audit for $0.07 — The open-source AI coding assistant company, founded by Scott Breitenother and Sytse Sijbrandij, says its self-run test found 13 of 17 planted bugs with MiniMa

Kilo Code AI (@kilocode), the open-source AI coding assistant company founded by Scott Breitenother and Sytse Sijbrandij, said Saturday in a nine-post thread on X that MiniMax M3 matched Anthropic's cheapest Claude Opus 4.8 run on a code-audit test at a fraction of the cost, finding 13 of 17 planted bugs for $0.07.

The company provides an AI coding agent that runs inside development environments including Visual Studio Code and JetBrains IDEs. Its platform can generate code from natural-language instructions, automate repetitive tasks, help with debugging, navigate large codebases with semantic search and memory context, and support more than 400 AI models through either user-selected providers or its own pay-as-you-go gateway.

Kilo Code AI is the source of the benchmark, not a neutral testing lab. In the thread, the company acted as test designer, fixture author, and scorer, framing the comparison around whether lower-cost models can handle repetitive code-review work.

The test fixture was a webhook delivery service written in TypeScript, Bun, and SQLite, according to Kilo Code AI. The company said it catalogued the 17 bugs in advance, then gave each model the same prompt: audit the code for security, reliability, correctness, and coverage, write a report, and do not edit the code.

The company said every run caught the major failures: missing authentication on routes, unsafe outbound requests, a non-constant-time signature check, duplicate webhook delivery risk, and missing idempotency. MiniMax M3 also caught several code-path bugs, including an endpoint returning a stored secret and a replay path accepting deliveries in the wrong state. Claude Opus 4.8 at xhigh and max found 15 of 17, the best raw score in the test.

The cost curve was less clean. Kilo Code AI said Claude max cost $3.39, 67% more than xhigh on slightly fewer tokens, without improving the finding count. The company also noted the headline tie between MiniMax M3 and the cheapest Claude run hid different miss patterns: MiniMax M3 caught a secret-returning endpoint that Claude medium missed, while Claude medium caught an async callback inside a sync transaction that MiniMax M3 missed.

The result is a vendor-run benchmark, not an independent leaderboard. Its sharper point is narrower: for repetitive code review work, model routing may matter as much as model selection.

Why it matters

Reader comments