Infini-AI-Lab says Vortex hits 3.46x throughput with agent-generated attention
The research framework lets agents write attention flows in Python, compile them into serving kernels, and benchmark end-to-end LLM throughput.
By Ryan Merket · Published
Why it matters
Sparse attention is moving from hand-tuned systems work into programmable search. If Vortex's benchmark claims hold up outside the lab, agents could compress the cycle from idea to serving-speed measurement from weeks of kernel work to minutes.

Infini-AI-Lab at Carnegie Mellon University has released Vortex, a research framework that lets AI agents design sparse-attention algorithms, compile them into fused kernels, and test them inside a real LLM serving stack.
The work, credited to Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, and Beidi Chen, targets a bottleneck that matters as reasoning and agent workloads stretch generation lengths: moving the KV cache during decoding. Vortex gives agents a Python-embedded frontend, vFlow, over a page-centric tensor abstraction, vTensor, then plugs the result into SGLang-compatible serving machinery.
The lab says the best agent-generated sparse-attention variant reached 3.46x the throughput of full attention on Qwen3-1.7B on AIME24 using Nvidia H200 hardware while preserving accuracy. Its project page also reports 4.7x throughput on GLM-4.7-Flash, 1.63x on Qwen3-30B-A3B MoE, and 1.37x on the 229B-parameter MiniMax-M2.7 under tensor parallelism on four B200 GPUs. Those are lab-reported benchmark results, not externally audited production measurements.
The more important claim is workflow, not a single speedup. Vortex says an agent ran an 18-hour loop across 23 iterations and 92 submissions, proposing and benchmarking sparse-attention variants without a human in the inner loop. The code is available on GitHub, and the accompanying paper frames Vortex as infrastructure for turning model-serving optimization into an agent-search problem.