Kexin Huang introduces BiomniBench to grade AI agents on how they think, not just what they answer
In an X post amplified by Peter Jansen, Huang calls BiomniBench the first benchmark to score the process behind an agent’s output, not only the final result.
By Ryan Merket ·
Why it matters
Agentic AI is judged today mostly on final outputs, which hides brittle or risky behavior in the steps. A process-focused benchmark would give builders a clearer signal of reliability and help push the field toward traceable, auditable agents.

Kexin Huang (@KexinHuang5) says she is launching BiomniBench, a new benchmark for AI agents that evaluates their problem-solving process rather than just their final answers. Huang announced the effort in a post on X, writing: "Introducing BiomniBench - the first benchmark focused on evaluating the process, not just the final answer, of AI agents..." The post was amplified by Peter Jansen (@peterjansen_ai).
Why a process-first benchmark
For the last year, agentic systems have shifted from single-shot responses to multi-step workflows: planning, tool use, reflection, and self-correction. Most benchmarks still compress all that into a single metric at the end. If an agent guessed right after a faulty chain of steps, it gets full credit; if it made one careful misstep on the way to a mostly correct solution, it often gets zero. A process-aware benchmark aims to reward the quality of the trajectory an agent takes, not only the destination it reaches.
That matters for reliability and safety. Teams deploying agents in production need to know whether an agent follows instructions, selects tools appropriately, flags uncertainty, and recovers from errors. Those signals live in the trace, not the final string. A benchmark that inspects traces could help distinguish fast-but-flaky agents from slower, more dependable ones, and make it harder for models to game evaluations with lucky outputs.
What we know (and do not yet)
Huang’s post frames BiomniBench as centered on process evaluation. Beyond that, the announcement we saw does not include task domains, scoring criteria, a paper, or a code release. It is also unclear whether BiomniBench will evaluate open traces only, require instrumented runs, or define a common schema for logs and actions. We will update when the team shares docs, a leaderboard, or a repository.
The broader push to measure traces
Across the ecosystem, researchers and builders have been exploring trajectory-aware evaluation: scoring intermediate plans, tool calls, and revisions; penalizing hallucinated tool outputs; and crediting correct recovery after error detection. The promise is a fairer signal for agents that must interoperate with software, data, and humans.
A benchmark like BiomniBench could catalyze convergence on shared trace formats and judging rubrics, making it easier to compare agents across frameworks and domains. It also raises hard questions: how to normalize across different toolchains, how to prevent overfitting to a judging style, and how to audit privacy when traces include real data. If Huang’s effort delivers transparent tasks and reproducible scoring, it will give teams a sturdier yardstick for whether their agents are getting better at the work between the prompts.
The bottom line
Huang is putting a stake in the ground for process-centric evaluation. If BiomniBench lands with clear tasks and an open scoring methodology, it could become a reference point for anyone shipping agents and needing proof they think well, not just answer well.