Miso Labs publishes Miso TTS 8B, an open voice model for conversational speech

Aoden Teo's release puts inference code, public weights and a local quickstart on GitHub, while the repo details voice-cloning support, watermarking and CUDA hardware limits.

By Ryan Merket · Published Jun 3, 2026, 2:26pm CT

Why it matters

Miso One shows how quickly expressive voice synthesis is moving from polished demos into downloadable models and APIs, while voice cloning, GPU cost, and hardware requirements remain the practical constraints operators have to evaluate.

Miso Labs publishes Miso TTS 8B, an open voice model for conversational speech — Aoden Teo's release puts inference code, public weights and a local quickstart on GitHub, while the repo details voice-cloning support, watermarking and CUDA h

Aoden Teo (@AodenTeoMT) released Miso TTS 8B, also described in the launch thread as Miso One, on X Wednesday, saying the 8-billion-parameter text-to-speech model is built for highly expressive speech generation rather than flat narration.

https://x.com/AodenTeoMT/status/2062204362102100295

Teo called the model "the most emotive voice model in the world," a claim RuntimeWire cannot independently verify from the launch thread alone. He also said it responds faster than a human, citing a 110-millisecond figure in the post, and said every voiceover in the thread was generated by the model.

The release is being distributed through the MisoTTS GitHub repo, with public weights hosted on Hugging Face and a demo on misolabs.ai. The repo describes Miso TTS 8B as a text-to-dialogue RVQ Transformer inspired by Sesame's CSM architecture, using a Llama 3.2-style 8B backbone, a smaller 300M autoregressive audio decoder, Mimi audio codes, 32 audio codebooks and a 2,048-token maximum sequence length.

For developers, the repo includes local inference instructions using uv or pip. The default example downloads MisoLabs/MisoTTS into the Hugging Face cache, runs run_misotts.py and writes full_conversation.wav in the repository root. The Python API can generate speech from text alone, and the prompted-generation example shows how prior audio and its transcript can be passed as context for voice cloning.

In replies, Teo said Miso TTS 8B can clone a voice from about 10 seconds of audio, supports only English for now and does not include a voice changer feature. The repo separately warns that the speech model should not be used to impersonate people, create deceptive audio, commit fraud or generate harmful content, and says generated audio is watermarked by default through a SilentCipher watermarking model.

The launch also disclosed some deployment tradeoffs. Teo told one user the landing page demo does not support arbitrarily long generation because of GPU costs, while the open model and API can produce longer audio. He said a system with about 32 GB of VRAM should be able to generate roughly a 2.5-minute clip, and described the current build as CUDA bf16 without proper quantization yet. The repo likewise says Miso TTS 8B is a large model and recommends a CUDA GPU with sufficient VRAM for the checkpoint precision being loaded.

Why it matters

Reader comments