ViBench aims to rank AI models by app-building, not just coding tests

The public site and ACM paper frame ViBench as an end-to-end test of whether coding agents can deliver usable apps, not just pass SWE-style tasks.

By · · updated

Why it matters

Coding model leaderboards increasingly shape buying decisions for founders and engineering teams. Masad's ViBench push argues that the relevant test is not just code correctness, but whether a model can turn prompts into complete apps cheaply and reliably.

ViBench aims to rank AI models by app-building, not just coding tests — Masad says Opus 4.8 beats GPT 5.5 on price and performance for end-to-end app creation, despite GPT's SWE benchmark lead.

Amjad Masad (@amasad) introduced ViBench, a benchmark he says is designed to measure how well AI models build apps end-to-end, in a post on X on Wednesday.

https://x.com/amasad/status/2062226152790675805

Masad framed the benchmark as a challenge to the way coding models are usually ranked. "Benchmarks place GPT 5.5 as the best model on SWE, but is it the best at making apps end-to-end?" he wrote. His answer, based on ViBench, is no: Masad said Opus 4.8 "continues to be the king of vibe coding" on both price and performance.

The project also has an ACM paper, "ViBench: A Benchmark on Vibe Coding", by Peter Zhong and co-authors, published in CAIS '26. The paper describes tasks derived from production traces across 15 web applications, spanning zero-to-one creation and feature extension, and an automatic evaluator that uses browser automation to test generated apps from an end user's perspective rather than assuming a particular code structure.

In the paper, the authors report evaluating nine models across 105 artifacts. Even the leading models were far from reliable: Opus 4.6 and GPT-5.2 reached 46% and 42% Pass@1, respectively, while no open-weight model exceeded 12% Pass@1. The authors also report 99% step-level agreement between the automatic evaluator and human experts across 1,082 test steps.

That distinction matters because SWE-style software engineering tests and app-building workflows are not the same thing. A model can perform well on discrete coding tasks while still struggling with the product loop that builders care about: generating an app, wiring the pieces together, iterating from prompts and producing something usable.

Masad called ViBench "the first benchmark for app creation based on real world tasks." The paper makes a narrower claim: ViBench is an open-source benchmark for end-to-end web application development from user-facing requirements, with tasks drawn from anonymized Replit production traces. The broader test will be whether outside evaluators adopt the task set, reproduce the scoring and treat it as a durable alternative to SWE-style leaderboards.

For now, ViBench is best read as Masad's attempt to move the AI coding debate away from abstract leaderboard wins and toward a narrower question: which model can actually help a user ship an app at acceptable cost.

Reader comments

Conversation for this story loads after sign-in.