Mistral opens Leanstral 1.5 in a bet on machine-checked AI

Arthur Mensch's Paris lab says the Lean 4 model solves 587 PutnamBench problems and found five previously unreported bugs.

By Ryan Merket · Published Jul 3, 2026, 11:11pm CT

Why it matters

Formal verification is becoming a competitive front in AI because it offers something chatbots cannot: machine-checkable evidence that code and math claims hold up.

Abstract representation of machine-checked AI rigorously solving complex mathematical proofs and identifying errors. (Scratchboard illustration, white scratches and dense crosshatching on a deep, inky blue background with subtle gold and gr

Mistral AI released Leanstral 1.5 on July 2, putting a free, Apache-2.0 Lean 4 proof model into the hands of developers at a moment when formal verification is moving from research demos into product strategy.

For co-founders Arthur Mensch, Guillaume Lample and Timothee Lacroix, Leanstral is a clear extension of the company they started in April 2023: a European frontier AI lab built around open models, enterprise deployment and technical independence. Mistral's about page frames the company around openness and control of the stack; Leanstral 1.5 applies that stance to theorem proving and code verification.

Mistral says Leanstral 1.5 has 119 billion total parameters and 6 billion active parameters. The Hugging Face page confirms the Apache-2.0 license and open availability.

The company is pitching Leanstral 1.5 as a practical proof-engineering system rather than a math-chat model. The release describes a three-stage training process: mid-training, supervised fine-tuning and reinforcement learning with CISPO. In one reinforcement-learning setup, the model receives a theorem statement, proposes a Lean proof, gets compiler feedback and retries until it proves or exhausts the task. In another, it works more like a coding agent inside a filesystem, editing files, running bash commands and using the Lean language server to inspect errors, goals and type information.

That agent framing matters. Mistral already sells Vibe for code and Studio as part of a broader enterprise stack. Leanstral 1.5 gives Mistral a research-heavy wedge into the part of coding where normal language-model autocomplete is weakest: proving that a program does what it claims. Fewer vibes, more invariants.

The benchmark claims are strong, and mostly company-reported

Mistral says Leanstral 1.5 reaches 100% on both validation and test sets for miniF2F, solves 587 of 672 PutnamBench problems at a 4 million token budget, and posts state-of-the-art results on FATE-H and FATE-X. The FATE wording needs care: Mistral's summary calls the results 87% on FATE-H and 34% on FATE-X, while the body says Leanstral solved 87 and 34 problems, respectively. Without the denominators in the article, the percentage-versus-count distinction should not be blurred.

The PutnamBench cost comparison is the most commercially relevant claim. Mistral says Leanstral costs about $4 per problem, compared with an estimated $300 or more for Seed-Prover 1.5 high, which it says runs at a budget of 10 H20-days per problem. Mistral also says Aleph Prover costs $54 to $68 per problem, though it notes higher-ranked systems may operate under different conditions, including natural-language proof guidance.

Leanstral's test-time scaling numbers are the center of the release. Mistral says Pass@8 on PutnamBench climbed from 44 solved problems at 50,000 tokens to 244 at 200,000 tokens, 493 at 1 million tokens and 587 at 4 million tokens. That is a familiar frontier-model pattern recast for proof work: throw more inference budget at the problem, let the model revise across long contexts, and convert compute into verified completions.

Mistral also open-sourced FLTEval, an evaluation harness for Lean tasks based on real pull requests from the Fermat's Last Theorem repository. The company says Leanstral 1.5 lifts pass@1 on FLTEval from 21.9 to 28.9 and pass@8 from 31.9 to 43.2, surpassing Opus 4.6's 39.6 at one-seventh the cost. Those are Mistral's figures; independent replication is not in the launch post.

The code-verification section is the sharper product clue

The most useful part of the release is not the leaderboard. It is Mistral's bug-finding pipeline.

Mistral says it tested 57 repositories using a workflow in which Aeneas translates Rust code to Lean, Leanstral infers intended correctness properties, then attempts to prove each property four times. If those attempts fail, it tries four times to prove the negation. The company says the process flagged 47 violated properties, of which 11 pointed to genuine bugs and five were previously unreported on GitHub.

One example is a bug the team says it found in the datrs/varinteger library. That is the kind of issue formal methods advocates have promised to catch for years, and it is more compelling than another contest benchmark because it maps directly to how developers ship broken software.

Mistral says the model's long-horizon behavior also showed up in an AVL-tree proof that ran for more than 2.7 million tokens across 22 compactions. The model established what Mistral calls an almost tight bound of 48 steps per height unit plus a constant for insertion, then connected height to tree size through a logarithmic relationship. Again, this is company-reported, but the direction is important: the model is being measured on sustained proof maintenance, not single-shot answer generation.

Mistral is entering a field where verification is the product

Leanstral 1.5 arrives after formal reasoning became a visible AI frontier. Google DeepMind's AlphaProof combined a pre-trained language model with AlphaZero-style reinforcement learning and Lean, solving four of six problems at the 2024 International Mathematical Olympiad for a silver-medal-level score. DeepSeek's DeepSeek-Prover-V2 paper described an open-source Lean 4 prover trained through reinforcement learning for subgoal decomposition. Harmonic, the Palo Alto math-AI company behind Aristotle, announced a $120 million Series C at a $1.45 billion post-money valuation and says Aristotle uses Lean 4 formal verification.

Mistral's difference is distribution. Leanstral 1.5 is on Hugging Face, has an Apache-2.0 license, and is available through a free API. That fits Mensch's broader strategy around control of the full AI stack: chips, energy, data centers, models and applications. Leanstral pushes that sovereignty argument into a technical niche where open access may matter more than marketing scale, because proof tools improve when researchers and developers can run, inspect and stress them.

The caution is evaluation. A June 2026 paper, Faults in Our Formal Benchmarking, argues that Lean theorem-proving benchmarks can still contain dataset defects and evaluation failures even when proofs are machine-checked, because a kernel verifies a formal statement rather than the informal intent behind it. The paper reports 4,833 findings across five widely used Lean theorem-proving benchmarks and their forks, including 398 mechanically certified issues such as counterexamples, vacuous theorems and unsound axioms.

That does not undercut Leanstral 1.5's release. It sets the bar for what comes next. Mistral has opened the weights and the FLTEval harness. The meaningful test is whether outside users can reproduce the benchmark gains, inspect the failure cases, and find bugs that maintainers accept as real. If they can, Mensch's company will have turned a research model into a proof assistant that earns its compute bill in production code.

Why it matters

The benchmark claims are strong, and mostly company-reported

The code-verification section is the sharper product clue

Mistral is entering a field where verification is the product

Reader comments