body { background: #06080c; color: #e5e9f0; margin: 0; } .rw-nojs-bar { max-width: 880px; margin: 0 auto; padding: 18px 16px 14px; } .rw-nojs-bar .rw-nojs-brand { font: 700 20px/1 Inter, system-ui, sans-serif; color: #e5e9f0; text-decoration: none; } .rw-nojs-nav { max-width: 880px; margin: 0 auto; padding: 0 16px 14px; border-bottom: 1px solid #1c2230; font: 500 14px/1.4 Inter, system-ui, sans-serif; } .rw-nojs-nav a { color: #6f9bff; margin: 0 14px 6px 0; text-decoration: none; display: inline-block; } .rw-nojs-nav a:hover { text-decoration: underline; } .rw-nojs-note { max-width: 880px; margin: 12px auto 0; padding: 0 16px; font: 400 13px/1.5 Inter, system-ui, sans-serif; color: #8a93a6; } #root [data-rw-crawler] { max-width: 880px; margin: 0 auto; padding: 8px 16px 48px; font: 400 16px/1.65 Inter, system-ui, sans-serif; color: #e5e9f0; } #root [data-rw-crawler] a { color: #6f9bff; } #root [data-rw-crawler] h1 { font-size: 28px; line-height: 1.2; } #root [data-rw-crawler] h2 { font-size: 20px; margin-top: 28px; } #root [data-rw-crawler] img { max-width: 100%; height: auto; } #root [data-rw-crawler] ul { padding-left: 0; list-style: none; } #root [data-rw-crawler] li { margin: 0 0 18px; } #root [data-rw-crawler] .rw-pagination { margin: 28px 0 0; display: flex; flex-wrap: wrap; gap: 12px; align-items: baseline; } #root [data-rw-crawler] .rw-pagination strong { color: #e5e9f0; } .rw-nojs-footer { max-width: 880px; margin: 40px auto 0; padding: 22px 16px 44px; border-top: 1px solid #1c2230; font: 400 13px/1.6 Inter, system-ui, sans-serif; color: #8a93a6; } .rw-nojs-footer .rw-nojs-fcols { display: flex; flex-wrap: wrap; gap: 28px 40px; margin-bottom: 20px; } .rw-nojs-footer h2 { font-size: 11px; letter-spacing: 0.05em; text-transform: uppercase; color: #b7c0d3; margin: 0 0 8px; } .rw-nojs-footer a { color: #6f9bff; text-decoration: none; display: block; margin: 0 0 5px; } .rw-nojs-footer a:hover { text-decoration: underline; } .rw-nojs-footer .rw-nojs-legal { font: 400 12px/1.6 Inter, system-ui, sans-serif; color: #6b7384; margin: 0; } .rw-nojs-footer .rw-nojs-legal a { display: inline; } RuntimeWire AI Startups Venture Products Funding Exits Models Head-to-Head About You're browsing RuntimeWire with JavaScript disabled. Articles and navigation work fully. Interactive features — search, comments, and newsletter signup — require JavaScript.

Infini-AI-Lab says Vortex hits 3.46x throughput with agent-generated attention

The research framework lets agents write attention flows in Python, compile them into serving kernels, and benchmark end-to-end LLM throughput.

By Ryan Merket · Published Jun 7, 2026, 2:26am CT

Why it matters

Sparse attention is moving from hand-tuned systems work into programmable search. If Vortex's benchmark claims hold up outside the lab, agents could compress the cycle from idea to serving-speed measurement from weeks of kernel work to minutes.

AI agents programming and optimizing large language model (LLM) performance through 'agent-generated attention' within a stylized research framework, focused on data throughput. (Isometric 3D render with paper-cut materials – chunky low-pol

Infini-AI-Lab at Carnegie Mellon University has released Vortex, a research framework that lets AI agents design sparse-attention algorithms, compile them into fused kernels, and test them inside a real LLM serving stack.

The work, credited to Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, and Beidi Chen, targets a bottleneck that matters as reasoning and agent workloads stretch generation lengths: moving the KV cache during decoding. Vortex gives agents a Python-embedded frontend, vFlow, over a page-centric tensor abstraction, vTensor, then plugs the result into SGLang-compatible serving machinery.

The lab says the best agent-generated sparse-attention variant reached 3.46x the throughput of full attention on Qwen3-1.7B on AIME24 using Nvidia H200 hardware while preserving accuracy. Its project page also reports 4.7x throughput on GLM-4.7-Flash, 1.63x on Qwen3-30B-A3B MoE, and 1.37x on the 229B-parameter MiniMax-M2.7 under tensor parallelism on four B200 GPUs. Those are lab-reported benchmark results, not externally audited production measurements.

The more important claim is workflow, not a single speedup. Vortex says an agent ran an 18-hour loop across 23 iterations and 92 submissions, proposing and benchmarking sparse-attention variants without a human in the inner loop. The code is available on GitHub, and the accompanying paper frames Vortex as infrastructure for turning model-serving optimization into an agent-search problem.

Reader comments

Conversation for this story loads after sign-in.

Sections

AI Startups Venture Products Funding Exits

Publication

About FAQ Contact Editorial Policy Corrections Policy Ethics

Tools

AI Model Pricing Head-to-Head SynthID Remover

Legal

Privacy Terms

© 2026 RuntimeWire, Inc. All rights reserved. · Gradient Noise, Inc.
An independent startup and technology publication based in Austin, Texas and San Francisco, California. Send tips to tips@runtimewire.com.