Z.ai's GLM-5.2 tops open-weight models on Artificial Analysis work benchmark
The open-weight model scored 1524 Elo on GDPval-AA, putting it near proprietary frontier systems on agentic knowledge-work tasks.
By Ryan Merket · Published
Why it matters
GLM-5.2 gives open-weight AI a stronger claim in agentic knowledge work: not just cheaper access, but benchmark performance near closed frontier models.

Z.ai (@Zai_org)'s GLM-5.2 ranked as the leading open-weight model and No. 3 overall on Artificial Analysis' GDPval-AA benchmark, according to Artificial Analysis (@ArtificialAnlys) in a three-post thread on X on Monday, June 22.
The score is the useful part: Artificial Analysis says GLM-5.2 posted a 1524 Elo on GDPval-AA, a benchmark built to test long-horizon, multi-turn agent work on economically valuable knowledge tasks. That puts an open-weight Chinese model in the same conversation as proprietary frontier systems on a workload category that is closer to how companies are beginning to use AI agents: not chat, but multi-step deliverables.
https://x.com/ArtificialAnlys/status/2069121548670406947
The result follows Artificial Analysis' broader June 17 benchmark note, which ranked GLM-5.2 as the top open-weight model on its Intelligence Index v4.1 with a score of 51. Artificial Analysis placed it ahead of MiniMax-M3 and DeepSeek V4 Pro, both at 44, and Kimi K2.6 at 43. On GDPval-AA v2 specifically, Artificial Analysis reported GLM-5.2 at 1524, ahead of MiniMax-M3 at 1418 and DeepSeek V4 Pro at 1328, and effectively level with GPT-5.5 (xhigh) at 1514.
Z.ai is making a familiar open-weight argument with sharper economics: if an enterprise or developer team can get near-frontier agent performance without locking into a closed model provider, the deployment decision becomes less about raw benchmark rank and more about control, cost and operational flexibility. Z.ai's pricing page lists GLM-5.2 at $1.40 per 1 million input tokens, $0.26 per 1 million cached input tokens and $4.40 per 1 million output tokens. Artificial Analysis calculated GLM-5.2 at about $0.46 per Intelligence Index task, higher than several open-weight peers but low for its score band.
The catch is token efficiency. Artificial Analysis said GLM-5.2 used 43,000 output tokens per Intelligence Index task, up from 26,000 for GLM-5.1 and above MiniMax-M3, Kimi K2.6 and DeepSeek V4 Pro. That matters because per-token pricing understates the actual cost of agentic systems when the model has to reason, inspect, revise and produce files over many turns. A cheaper token can still become an expensive task if the model spends enough tokens getting there.
Z.ai introduced GLM-5.2 in a June 16 research post as a long-horizon model with a 1 million-token context window, an MIT open-source license and an architecture change called IndexShare, which Z.ai says reduces per-token FLOPs at long context lengths. Artificial Analysis lists the model at 744 billion total parameters with 40 billion active parameters, the same size profile as GLM-5.1, with the context window expanded from 200,000 tokens to 1 million.
The GDPval framing is important because it is not another multiple-choice exam. The original GDPval paper, published in October 2025, described a benchmark for real-world economically valuable tasks across 44 occupations and nine major U.S. GDP sectors, using work products created by experienced professionals. Artificial Analysis' GDPval-AA v2 adapts that line of evaluation for model comparison with a human baseline of 1000 Elo, a rotating panel of frontier-model judges and a higher turn limit for longer agent trajectories.
That design also sets limits on what the score proves. GDPval-style work is still scoped digital knowledge work, not the messy totality of a job. It excludes manual work, tacit organizational judgment, private data access and live collaboration. The benchmark is better read as evidence that GLM-5.2 can produce competitive deliverables under controlled agent conditions, not that it can replace a professional workflow end to end.
Artificial Analysis' other new agent benchmark points in the same direction. In AA-Briefcase, a benchmark built around multi-week knowledge-work projects using thousands of source files, Artificial Analysis says GLM-5.2 (max) is the clear leader among open-weight models and ranks behind Claude Fable 5 and Claude Opus 4.8 (max), while ahead of GPT-5.5 (xhigh). The benchmark combines rubric pass rate, analytical quality and presentation quality, which is a tougher test for agent systems than answering a single prompt.
The competitive signal is straightforward: the open-weight frontier is moving from coding demos and leaderboard claims into the professional-work benchmarks that closed labs have used to justify premium pricing. GLM-5.2 does not remove the tradeoffs. It appears less token-efficient than some open-weight peers, and its top-line benchmark rank still comes from Artificial Analysis' evaluation stack rather than broad production evidence. But for Z.ai, the GDPval-AA result gives GLM-5.2 a stronger enterprise-facing claim than the usual open-model pitch. It is not just open. On this benchmark, it is close enough to make closed-model procurement harder to defend without a task-specific test.