UniPat AI team debuts SaaS-Bench as agents finish under 4% of real SaaS workflows

Built across 23 live SaaS apps and 106 long-horizon tasks, the open benchmark finds frontier agents stumble on planning, memory, cross-app context, and error recovery.

By Ryan Merket · Published May 18, 2026, 4:38pm CT

Why it matters

The benchmark moves the agent conversation from short demos to real enterprise workflows. If frontier systems cannot reliably complete more than a sliver of long-horizon tasks, product teams need to invest in planning, memory, and error recovery before promising hands-off automation. A shared, open yardstick also helps separate real progress from hype.

An AI agent struggling to complete a complex, multi-application digital workflow (Scratchboard / woodcut, with white scratches on black, dense crosshatching, and visible wood grain textures within the digital elements)

UniPat AI researchers, working with collaborators from Peking University, the University of Hong Kong, 0G Labs, and Pipeline Lab, released SaaS-Bench, a new evaluation that puts computer-use agents inside real SaaS products. In a paper on arXiv, the team reports that even the strongest model they tested completed fewer than 4% of 106 end-to-end workflows.

What they built

SaaS-Bench is grounded in deployable, production SaaS tools rather than toy websites or single-page apps. The suite spans 23 systems across six professional domains and asks agents to execute realistic, long-horizon workflows that often require more than 100 interaction steps. Tasks cover both text-only and multimodal settings and force agents to coordinate across apps, maintain context over time, and handle dynamic UI and state changes.

To score progress, the authors use weighted verification checkpoints that capture strict completion as well as partial credit. That lets the benchmark distinguish between an agent that stalls at step two and one that gets most of the way there but fumbles the last mile. The code and task definitions are open at UniPat-AI/SaaS-Bench.

What the results say

Across seven frontier models evaluated, performance dropped sharply in these realistic settings. The paper reports fewer than 4% of tasks completed end-to-end by the best system and highlights recurrent failure modes: brittle planning, weak state tracking and memory, poor cross-application context maintenance, and limited error recovery when something goes off-plan.

Those findings echo a growing gap between marketing demos and day-to-day enterprise work. In controlled benchmarks, agents can pass short, isolated tasks. In live SaaS, they must handle login flows, changing layouts, app-to-app dependencies, and incomplete instructions while keeping a coherent plan over dozens of steps.

Why they built it

The UniPat AI team argues that existing web and GUI agent benchmarks overestimate capability because they simplify page logic, remove real backend constraints, or restrict tasks to single apps with short horizons. SaaS-Bench aims to reflect the messy reality of modern knowledge work, aligning with UniPat AI's stated focus on training AI in realistic scenarios that produce productive, trustworthy, generalizable systems.

The bet

By anchoring evaluation in real workflows, the authors are betting the field will focus on the hard problems that actually block deployment: planning over long horizons, memory and state, robust grounding to UIs, and recovery from errors. For founders and teams building agents into SaaS, SaaS-Bench offers a common yardstick and a place to prove progress. The project is open, so others can reproduce results or add tasks that mirror their own stacks.