YC Startup Trata Releases Hedge-Bench, a New Benchmark Built From Hedge Fund Analyst Reasoning
The YC W25 company says the benchmark uses 102 tasks drawn from hedge fund analyst reasoning traces.
By Ryan Merket · Published

WHY IT MATTERS
As AI benchmarks become increasingly saturated with coding tasks, math problems, and synthetic evaluations, a growing number of researchers are asking a different question: how should we measure performance on real-world knowledge work?
https://x.com/trytrata/status/2062962521892598174
Trata, a startup from Y Combinator's Winter 2025 batch, believes the answer may lie in the workflows used by professional investors.
This week, the company released Hedge-Bench, a benchmark designed to evaluate open-ended financial reasoning using tasks derived from the work of hedge fund analysts. The benchmark contains 102 evaluation tasks built from analyst reasoning traces and supporting research materials. According to the company, no frontier model scored above 16% on the benchmark.
The release offers a rare look not only at how leading AI systems perform on financial research tasks, but also at the unique dataset Trata has been quietly building behind the scenes.
Building Around Investor Knowledge
Founded by Eric Cho and Sean Park, Trata operates at an unusual intersection of expert networks, financial research, and artificial intelligence.
The company's thesis is straightforward. Some of the most valuable information in public markets never appears in SEC filings, earnings calls, investor presentations, or commercial data platforms. Instead, it exists inside the research processes of professional investors who spend weeks developing conviction around individual companies and sectors.
Rather than attempting to replace those investors, Trata has spent the past year building systems that capture, structure, and distribute their insights.
According to company materials, the platform works with more than 125 hedge funds representing over $175 billion in assets under management. Participating firms contribute research and perspectives that are organized into a searchable knowledge network intended to help investors discover insights from peers working on similar companies and themes.
That process appears to have created an asset that extends beyond the company's primary product.
It also created a large collection of professional reasoning traces.
Hedge-Bench is the first public example of how Trata intends to use that data.
From Analyst Workflow to Benchmark
Most existing finance benchmarks focus on tasks with objectively verifiable answers.
A model might be asked to retrieve information from a filing, calculate a financial metric, classify a document, or answer questions about a company's reported results. These evaluations are useful because they are relatively easy to score and reproduce.
However, they often measure information retrieval more than judgment.
Hedge-Bench attempts to evaluate a different layer of the workflow.
According to the paper, each benchmark task originates from real analyst work and includes the information sources available to the analyst as well as the reasoning process used to arrive at a conclusion. The benchmark spans a range of research activities that require synthesizing information across multiple sources rather than extracting a single fact.
The goal is not simply to determine whether a model can locate information.
The goal is to evaluate how effectively it can interpret that information, weigh competing signals, and reason through ambiguity.
Those are often the parts of investment research that are hardest to automate.
A Shift Toward Measuring Jobs
The release arrives amid a broader shift in AI evaluation.
For much of the current AI cycle, benchmark development has focused on capabilities. Researchers measured whether models could write code, solve mathematical problems, answer academic questions, or complete browser-based tasks.
Increasingly, however, benchmark creators are moving toward representations of actual occupations.
Software engineering became a focal point with the emergence of SWE-Bench. Browser-based evaluations have evolved to reflect workflows performed by analysts, operators, and administrative workers. Newer benchmarks increasingly measure the completion of end-to-end tasks rather than isolated skills.
Hedge-Bench extends that trend into financial research.
Whether the benchmark ultimately becomes widely adopted remains unclear, but it reflects a growing belief that evaluating real professions may be more informative than evaluating abstract capabilities.
That distinction matters because companies are no longer selling AI as a collection of features. They are increasingly marketing AI systems as analysts, researchers, consultants, associates, and other forms of knowledge workers.
What the Results Suggest
The headline finding from the paper is that current frontier systems struggled on the benchmark.
Claude Sonnet 4.6 achieved the highest reported score at 15.4%, followed by Claude Opus 4.7 at 11.9%. OpenAI's GPT-5.5 scored 9.4%, while Claude Opus 4.8 and Gemini 3.5 Flash scored 8.7% and 8.6%, respectively.
The authors also report high hallucination rates on certain multi-step reasoning tasks and note that newer generations of models did not consistently outperform earlier versions.
Benchmark results should always be interpreted cautiously, particularly when they originate from organizations with direct exposure to the problem being measured. Even so, the findings align with a pattern that has emerged across several recent agent evaluations.
Accessing information is becoming increasingly commoditized.
Reasoning over that information remains considerably more difficult.
The Questions Hedge-Bench Raises
The benchmark's most distinctive characteristic may also become its most debated.
Unlike many existing evaluations, Hedge-Bench relies heavily on reasoning traces produced by professional analysts.
That approach creates a more realistic representation of investment work, but it also raises questions about evaluation methodology.
How many valid ways are there to analyze a company?
Should multiple reasoning paths receive equal credit?
How much disagreement exists among experienced investors working from the same information?
And to what extent should benchmark performance be measured against a particular expert process rather than an objective outcome?
Those questions extend well beyond finance.
Researchers building evaluations for law, consulting, scientific research, medicine, and other professional domains face many of the same challenges. The closer benchmarks move toward human judgment, the harder they become to score deterministically.
In that sense, Hedge-Bench may be as interesting for the methodological questions it raises as for the scores it reports.
More Than a Benchmark
For Trata, the release serves as both a research contribution and a glimpse into the company's broader strategy.
Many startups are building systems that consume financial information. Trata appears to be building around something more specific: the reasoning processes of the people who interpret that information for a living.
That distinction may prove important.
Data is increasingly abundant. Financial documents are widely available. Earnings calls are transcribed instantly. Research tools have never been more accessible.
Expert reasoning remains considerably harder to collect.
Whether that becomes a durable competitive advantage for Trata remains an open question. What is clear is that the company has assembled a dataset that few organizations possess, and Hedge-Bench offers one of the first public views into how it may be used.
As AI benchmarks continue their migration from academic exercises toward real-world occupations, Hedge-Bench provides an early example of what that next generation of evaluation may look like.