Marin debuts Delphi to forecast 25B pretraining runs with 0.2 percent error

In an X thread, Will Held says small-run fits predicted a 25B-param, 600B-token run at ~1e23 FLOPs, matching Paloma macro loss within 0.2% error.

By ·

Why it matters

Training large open models is expensive and failure-prone. If small, controlled runs can reliably forecast big-run performance, teams can plan compute, choose recipes, and catch regressions before burning months and money.

An intricate forecasting mechanism predicting large-scale AI training parameters with high accuracy (Vintage scientific illustration — engraved plate from a 19th-century journal, sepia ink on cream paper, with subtle teal line work for comp

Marin introduced Delphi, a scaling suite meant to make large pretraining runs predictable, with early results that extrapolate from small models to a 25B-parameter run within 0.2 percent error. In a detailed thread on X, Will Held (@WilliamBarrHeld) framed the goal simply: "To train better open models, we need predictable scaling."

What Marin shipped

In a blog post and Held's thread, Marin says it trained a family of small Dyna models ranging from 72M to 6.9B parameters under a single fixed recipe, fit a scaling law, and then extrapolated roughly 300x to forecast a 25B-parameter model trained on 600B tokens. That forecasted performance matched the observed Paloma macro loss with about 0.2% error at around 1e23 FLOPs, confirming the fit across more than two orders of magnitude in compute.

The post highlights controlled experiments as the core of Delphi: keep the recipe stable, sweep the parameter and token tradeoff, and tune only the key hypers that matter for scale. The accompanying interactive figures, linked from the post, are called out as making the scaling intuition concrete.

How they say it changes model development

David Hall (@dlwh) summed up the workflow shift in a reply on X: "Delphi changes how we evaluate new ideas: start small, sweep the param/token tradeoff, scale the key hypers, compare against forecasts, repeat." In a separate note, Hall urged readers not to skip the visuals, saying the interactive figures help show what transfers and what breaks as you scale.

Several researchers amplified the release, including reposts from Percy Liang (@percyliang), Nando de Freitas (@NandoDF), and Siddharth Karamcheti (@siddkaramcheti). Stella Biderman (@BlancheMinerva) called the work "Really incredible."

What to watch next

If Delphi's fits generalize across recipes and data domains, it gives open-model teams a way to de-risk expensive runs: prove the scaling behavior at small sizes, forecast the bigger run, and track deviations as a signal that something in the recipe or data changed. For operators planning 9- to 10-figure FLOP budgets, turning unknowns into forecastable curves is the difference between iteration and burn.

Marin's team is positioning Delphi as a first step toward more reliable open pretraining. The next questions are the ones practitioners care about: how robust the fits are across distributions and architectures, how early in training the forecasts stabilize, and when the param/token balance shifts as data improves. The blog and thread suggest the suite is set up to iterate on exactly those fronts.

Reader comments

Conversation for this story loads after sign-in.