Depthfirst turns FFmpeg into a proof point for autonomous security agents

The AI security startup says its agent found 21 FFmpeg zero-days for about $1,000, including an RCE exploit primitive.

By Ryan Merket · Published Jun 13, 2026, 11:38pm CT

Why it matters

Depthfirst is showing that the scarce resource in AI security may no longer be vulnerability discovery alone, but validated triage, responsible disclosure and patch capacity.

An AI agent autonomously dissecting software code to expose vulnerabilities (hand-drawn editorial illustration)

Depthfirst has published one of the clearer tests yet of its core bet: autonomous security agents will be judged not by how many warnings they produce, but by whether they can prove exploitable bugs in code that has already been attacked for decades.

In a research post, depthfirst says its production security agent found 21 zero-day vulnerabilities in FFmpeg, the open-source multimedia framework that sits inside browser, streaming, surveillance, transcoding and media-ingest pipelines. The company says the run cost roughly $1,000, produced concrete proof-of-concept inputs, and found bugs that had survived years of fuzzing, manual audits and recent AI-assisted reviews by Google and Anthropic. Read the post.

A hard target, chosen on purpose

FFmpeg is a useful proving ground because it is the opposite of a toy benchmark. Depthfirst describes the project as roughly 1.5 million lines of optimized C code covering hundreds of media formats. It also sits in a dangerous part of the stack: media parsers routinely process complex, untrusted files and streams. That makes the code both widely exposed and unusually picked over.

Depthfirst explicitly frames the work against two recent efforts from larger AI labs. The company says Google's Big Sleep team had disclosed 13 FFmpeg vulnerabilities, and it points to Anthropic's Claude Mythos Preview work on FFmpeg.

Depthfirst's counterclaim is not that it had a stronger frontier model. It says it did not have access to Mythos. Its claim is that the system around the model - the agent loops, harnesses, execution checks and vulnerability-specific workflow - changed the economics. In the FFmpeg run, depthfirst says its agent found 21 zero-days for about one-tenth of Anthropic's cited spend.

The disclosure includes a useful discrepancy

Depthfirst's writeup says eight of the issues had been assigned CVEs, then lists nine identifiers: CVE-2026-39210 through CVE-2026-39218. The safest reading is that the post contains a count error while the identifiers themselves are the useful record.

Those listed issues cover a TS demuxer heap buffer overflow introduced in 2010, an integer overflow from a 2010 swscale refactor, a stack overflow from a July 2025 regression in ffmpeg_opt.c, a 2023 yuv4mpegenc heap buffer overflow, a stack buffer overflow in the original SDT implementation the post traces to 2003, and several other heap overflows in areas including update_mb_info(), img2enc.c, the VP9 decoder and the DASH demuxer.

The remaining findings are tracked by depthfirst's internal IDs. Examples in the post include bugs in RTP AV1 depacketization, swscale graph code, RTP JPEG depacketization, AVIF overlay parsing, and RTP LATM depacketization.

The age distribution is the part that should make operators uncomfortable. Some issues are recent regressions from 2024 and 2025. Others trace back to 2003, 2010, 2012 and 2017, according to depthfirst. That is the uncomfortable shape of modern software risk: code can be both heavily audited and still carry old memory-safety flaws in rarely exercised but exposed parser paths.

The strongest proof is the AV1 RTP bug

Depthfirst spends the most time on DFVULN-127, a heap buffer overflow in FFmpeg's AV1 RTP depacketizer. The company also published a GitHub proof-of-concept repository titled "FFmpeg AV1 RTP Depacketizer - Heap Overflow to PC Hijack PoC" demonstrating an instruction-pointer hijack.

That is the difference between a model producing a plausible bug report and a security agent producing something a maintainer can prioritize: instead of a static warning, the agent generates a reproducible input that proves reachability and exploitability, letting upstreams triage and fix with evidence in hand.

Why it matters

A hard target, chosen on purpose

The disclosure includes a useful discrepancy

The strongest proof is the AV1 RTP bug

Reader comments