SGLang adds DFlash to push Qwen 3.5 397B-A17B inference up to 4.3x faster
Z Lab, Modal and LMSYS released a DFlash drafter for Qwen's 397B model and benchmarked it above native MTP on 8x B200 GPUs.
By Ryan Merket · Published
Why it matters
The release shifts the open-model fight from model access to serving economics: faster verified decoding can lower the cost of running frontier-scale open weights.

Jian Chen (@jianchen1799), Yesheng Liang and Zhijian Liu (@zhijianliu_) have pushed DFlash, Z Lab's block-diffusion speculative decoding method, into SGLang through a collaboration with Modal and LMSYS, with the teams reporting up to 4.31x higher throughput over a non-speculative baseline on Qwen 3.5 397B-A17B.
The June 15 release is not another model launch dressed up as infrastructure. It is a serving-stack bet: if open-weight frontier models keep getting larger, the bottleneck shifts from access to the model to the cost of serving it at usable latency. DFlash is aimed directly at that constraint, using a small draft model to propose token blocks that the target model can verify instead of forcing the target model to generate every token one at a time.
The teams published the technical write-up on the LMSYS blog and released the same Qwen 3.5 397B-A17B DFlash drafter across three Hugging Face organizations: z-lab/Qwen3.5-397B-A17B-DFlash, modal-labs/Qwen3.5-397B-A17B-DFlash and an LMSYS-hosted copy linked from the write-up. The Hugging Face model card lists an Apache-2.0 license and includes benchmark configuration, reproduction scripts, runtime patches and raw benchmark outputs under its benchmark directory.
The headline result comes with a narrow but important benchmark frame. On HumanEval at concurrency 1, running Qwen 3.5 397B-A17B in bfloat16 on 8 Nvidia B200 GPUs with SGLang, tensor parallel size 8, greedy decoding, thinking enabled and max output length of 4096 tokens, DFlash with block size 16 reached 874.6 output tokens per second, compared with 202.9 for the baseline. That is the reported 4.31x speedup. In the same row, Qwen's native MTP path with 7 steps reached 557.9 output tokens per second and with 15 steps reached 480.5.
At higher serving load, the gap narrows but does not disappear in the published benchmark table. At concurrency 32 on HumanEval, the baseline reached 2452.7 output tokens per second, while DFlash block size 8 reached 6666.0 and block size 16 reached 6783.7, a 2.72x and 2.77x speedup respectively. The model card says throughput was measured as generated output tokens divided by wall-clock benchmark time, including prefill and scheduling, with five independent runs per configuration across GSM8K, MATH500, HumanEval, MBPP and MT-Bench.
That definition matters. AI serving claims often hide in idealized decode-only measurements or omit scheduler overhead. Here, the teams are explicitly putting SGLang's runtime into the denominator. That makes the result more useful to operators, although still bounded by a high-end 8x B200 setup, Qwen-specific drafter weights and the workloads the teams chose to publish.
The technical bet: draft blocks, then verify
Conventional speculative decoding uses a smaller draft model to propose future tokens, then has the larger target model verify those tokens in parallel. If the target model accepts the draft, the server advances multiple tokens in one cycle. If not, the target corrects the sequence. The speedup depends on two things: how many proposed tokens the target accepts and how cheap the drafter is to run.
DFlash changes the drafter side of that loop. Z Lab's DFlash GitHub repository describes it as a lightweight block diffusion model for speculative decoding that generates token blocks in parallel. The LMSYS write-up says DFlash combines block diffusion drafting with KV injection, where hidden representations from the target model are injected into the draft model's KV cache so the drafter is conditioned on the target model's context throughout generation.
That design is why the release is more than a benchmark flex. Native MTP, EAGLE-style drafters and other speculative paths can still carry sequential work inside the draft model. DFlash is trying to remove that inner autoregressive bottleneck by proposing a full block in a single forward pass, then letting the target model verify. The quality claim rests on the verifier loop: the target model still accepts or rejects proposed tokens.
Spec V2 is the other half of the gain
The SGLang integration is also a scheduler story. The LMSYS write-up says DFlash was first added to SGLang's older speculative decoding engine, then moved into the new Spec V2 engine with overlap scheduling. SGLang's documentation describes Speculative Decoding V2 as an implementation that enables an overlap scheduler and V2 speculative workers.
The point is to hide CPU-side work that can otherwise erase GPU-side gains. In the release write-up, the teams say Spec V2 overlaps host-side cleanup from one batch with GPU work on the next batch, and overlaps host KV allocation for one batch with GPU work on the previous batch. In one cited SGLang benchmark for Qwen 3-8B on a single B200 at concurrency 32, that raised performance from about 11.4 thousand tokens per second to about 15.3 thousand tokens per second, an improvement of more than 33%.
That is the practical reason Modal engineers David Wang (@_dcw02) and Charles Frye (@charles_irl) are named alongside the Z Lab and SGLang contributors in the acknowledgement section. DFlash needed model-side research, but it also needed serving-engine work: a draft model architecture inside SGLang, KV-cache integration between draft and target, and a scheduler path that does not hand the gains back to CPU coordination.
The release also shows how open model infrastructure is becoming a coordination problem. Z Lab produced the DFlash method and drafter weights. SGLang provides the serving runtime. Modal contributed infrastructure and engine integration work. LMSYS published the joint technical account. Hugging Face hosts the artifacts. None of those pieces alone makes Qwen 3.5 397B-A17B cheaper to serve; the gain appears when model weights, draft weights, kernels, scheduler and deployment surface line up.
What remains unproven
The published numbers are strong, but they are not universal. They are reported by the collaborating teams, not an independent benchmark lab. They use specific hardware, backends and workloads. They also depend on having a trained DFlash drafter for the target model; speculative decoding is only useful when the drafter is fast enough and accurate enough for the target model to accept a meaningful number of tokens.
The DFlash repository lists draft models for a broader set of targets, including Qwen, Kimi, MiniMax, Gemma and Llama variants, and says support for additional models can be requested through GitHub issues. But the new 4.31x claim should be read as a Qwen 3.5 397B-A17B result under the disclosed benchmark conditions, not a blanket speedup for every model or workload.
Still, the direction is clear. The AI infrastructure market has spent the past two years treating GPU supply as the hard ceiling. DFlash and Spec V2 attack the other ceiling: wasted serial work during inference. For open-weight models, that distinction matters. A model that is technically downloadable but uneconomic to serve is not really open in the operational sense. The teams behind this release are betting that the next phase of open intelligence is not only about releasing weights, but about making those weights cheap enough to run.