Hugging Face-led team open-sources Carbon, a fast DNA foundation model with 393k bp context

Three checkpoints (500M, 3B, 8B) ship under Apache 2.0 with open weights, code, and data; a 6-mer tokenizer and a new loss drive the team’s cited ~275x throughput vs Evo2 while the 3B matches Evo2-7B’s win rate, trained on ~1T tokens, and they say a single GPU can process a human genome in under two days.

By Ryan Merket · Published May 19, 2026, 3:18pm CT

Why it matters

If the speed claims hold, single-GPU whole-genome analysis in under two days could push genomics workloads out of specialized clusters and into standard lab rigs, lowering costs and broadening access.

AI model rapidly processing a DNA helix on a single GPU (Hand-drawn editorial illustration)

Hugging Face and research partners released Carbon, an open-source autoregressive genomic foundation model focused on fast, long-context DNA modeling. The initial drop includes three checkpoints at 500M, 3B, and 8B parameters with open weights, training code, and data under the Apache 2.0 license, trained on roughly 1T tokens of curated genomes. The 3B model card is live at HuggingFaceBio/carbon-3b, alongside a tech report, dataset, and GitHub codebase. The team frames Carbon as open code, open weights, open data.

[ELI5 callout]

DNA is a very long string made of four letters (A, C, G, T). Carbon works like autocomplete for that string: given a stretch of DNA, it predicts what comes next and how likely different changes are.
It reads six letters at a time (a 6-mer tokenizer), which makes it run much faster than reading one letter at a time.
Long context means it can keep track of hundreds of thousands of letters at once, so it can connect far-apart patterns in the genome.
Why it matters: faster, cheaper runs make it easier to scan whole genomes, score the impact of variants, and explore edits. The team says a single GPU can process a whole human genome in under two days.
It is released with open code, open weights, and open data under Apache 2.0, so researchers can inspect and adapt it.

The Carbon DNA model centers its speed claims on two choices from the technical report and demos: a fixed 6-mer tokenizer that compresses sequence length 6x and a Factorized Nucleotide Supervision objective that grants partial credit to near-miss tokens late in training. Architecture is deliberately vanilla (Llama-like decoder) to inherit vLLM optimizations. The team says this combination yields order-of-magnitude throughput gains: a benchmark plot shows Carbon-3B matching Evo2-7B’s win rate at roughly 275x the throughput, and single-H100 throughput figures cited in the materials include Carbon-500M at ~152k bp/s, Carbon-3B at ~123k bp/s, and Carbon-8B at ~85k bp/s.

Context length is extended in two steps: pretraining at 8k tokens (~49 kbp), a training-time phase to 32k tokens (~197 kbp), then YaRN at inference to 64k tokens for the 3B (~393,216 bp) and 128k tokens for the 8B (~786 kbp). On a long-context retrieval task (Genome-NIAH), the materials report Carbon-8B reaching 65% exact-match at 786 kbp versus 53% for Evo2-7B at the same length. The public positioning emphasizes commodity hardware: a charted speedup and Leandro von Werra’s post state that a single GPU can process a whole human genome in under two days.

Beyond speed, the team highlights training-free capability evaluations: generative sequence recovery; variant-effect scoring on ClinVar and other sets; sequence perturbation tests (motif insertion, synonymous codon shuffling); and long-context retrieval. The accompanying demos include an Intro DNA Lab and a Carbon Recipe sandbox showing gene autocomplete and confidence tracking across exon-intron structure; per-base variant likelihoods without supervised labels; species-specific continuation given a few hundred bases of context; protein folding from generated coding regions via ESMFold; and embeddings that cluster genes by kingdom and reconstruct a species tree from mean-pooled vectors.

Competitive landscape: Carbon enters a field that is prioritizing longer contexts and higher throughput for DNA language modeling. Many efforts benchmark against Evo2, and this release does the same, citing a Carbon-3B win rate comparable to Evo2-7B at far higher tokens-per-second. Across the space, the main levers are tokenization to compress sequences, objectives that better reflect biological uncertainty, and inference stacks tuned for commodity GPUs. Carbon’s choice of a plain decoder to inherit vLLM optimizations, a fixed 6-mer tokenizer, and factorized supervision positions it as a throughput-first, fully open offering within that landscape.

https://x.com/lvwerra/status/2056774820872831234

The collaborators list Carbon as a joint effort by Hugging Face, the Zhongguancun Academy, TIGEM, and the Universita degli Studi di Napoli Federico II. von Werra announced the release and the throughput result in a thread on X, with a chart and demo pointers in follow-ups. The model card and release materials describe the tokenizer, loss, corpus curation (~1T tokens across ~6T base pairs), and context-extension recipe. The materials also flag data-sensitivity risks: genomic inputs and outputs may be handled differently depending on where inference runs, and users should understand provider data handling before using the model. We will update as new benchmarks and artifacts land across the 500M and 8B variants.

Why it matters

Reader comments