Cartesia packages Sonic-3.5 and Ink-2 into a full voice-agent stack
Karan Goel is using benchmark wins to pitch Cartesia as both the speaking and listening layer for real-time AI agents.
By Ryan Merket ยท Published
Why it matters
Cartesia is using benchmark wins to argue that voice agents should standardize on one low-latency speech stack, not stitched-together TTS and STT vendors.

Karan Goel (@krandiash) is turning Cartesia's May model releases into a broader platform push: Sonic-3.5 for text-to-speech, Ink-2 for speech-to-text, and a claim that Cartesia can now own both sides of the live voice-agent loop.
"We're now the only provider to have #1 models for both speaking and listening," Goel wrote in a post on X on Monday, June 15.
That framing is aggressive, and it needs the timestamp attached. Cartesia's own documentation lists the stable Sonic 3.5 snapshot as released on May 4, 2026, and Ink 2 as released on May 22, 2026. Monday's news is not that the models first appeared. It is that Goel is now bundling them as Cartesia's answer to the practical bottleneck in voice agents: the agent must hear accurately, know when the human has finished speaking, reason, and start speaking back without the awkward pauses and interruptions that make most phone agents feel mechanical.
That is the right battleground for Cartesia. Goel and Albert Gu (@_albertgu) came out of Stanford AI Lab with a founding team that Cartesia says met as PhDs and invented State Space Models, or SSMs, as a foundation-model primitive meant to be more efficient than transformer architectures for long, streaming sequences. Cartesia's own company page says the team spent four years building SSM theory across text, audio, video, images, and time-series data before productizing the work around voice. The important part for buyers is not the acronym. It is whether that architecture lets Cartesia shave latency from every handoff in a live conversation.
The model pages show where Cartesia is pressing the advantage. Sonic 3.5 is Cartesia's current text-to-speech model, with 42-language support, sub-90ms latency, and features aimed at the annoying details of production calls: alphanumeric readouts, emails, phone numbers, confirmation codes, and context-dependent English pronunciations such as heteronyms. Cartesia says the model is ranked No. 1 for naturalness, while Artificial Analysis listed Sonic 3.5 with a 1203.89 Elo score in its Speech Arena crawl last week.
Ink 2 is the more strategically important release because it attacks a less glamorous but more damaging failure mode: endpointing. In voice agents, the mistake is often not what the model says, but when it says it. Cartesia's docs say Ink 2 has native turn detection and emits turn lifecycle events including turn.start, turn.update, turn.eager_end, turn.resume, and turn.end, giving an agent a direct signal for when to keep listening and when to respond. That reduces the need to bolt on a separate voice activity detection layer, which is where many live-agent stacks add latency or cut users off mid-sentence.
Independent benchmarking supports part of Goel's listening claim. Artificial Analysis' June 2 streaming STT benchmark reported that Cartesia Ink-2 with semantic endpoints had the highest final-after-end-of-speech accuracy in its test, at 3.59% word error rate and 0.21 seconds of latency. ElevenLabs Scribe v2 Realtime followed at 3.64% WER and 0.14 seconds, while Cartesia Ink-2 with external endpoints posted 3.66% WER and 0.09 seconds. The benchmark used roughly eight hours of audio across AA-AgentTalk, VoxPopuli, and Earnings22, with AA-AgentTalk weighted at 50% of the streaming index.
The caveat is important: "No single model leads everywhere," Artificial Analysis wrote in the same benchmark note. ElevenLabs led on AA-AgentTalk, AssemblyAI and Google led on parts of VoxPopuli, and Cartesia led on Earnings22. Goel's headline claim is therefore a leaderboard snapshot and category framing, not a universal law of voice AI. It also combines different benchmarks: TTS naturalness is judged through a preference-style arena, while STT performance is measured through word error rate and latency.
Still, Cartesia's move is well timed. Voice agents have shifted from demo novelty to production workflow in customer support, scheduling, healthcare intake, recruiting, and sales qualification. In those settings, a model that sounds good but waits too long is unusable. A transcription model that is accurate but cannot tell whether the caller has finished speaking is equally unusable. Cartesia is trying to make the developer decision less about one model and more about a two-model loop with a common API surface.
Cartesia's pricing page shows how Cartesia is packaging that loop. The free plan includes 20,000 credits per month and roughly 27 minutes of Sonic-3.5 text-to-speech, while paid plans scale from Pro to Startup and Scale. For Ink-2, Cartesia lists included monthly usage from roughly 1 hour 51 minutes on Free to roughly 740 hours 44 minutes on Scale. Cartesia also sells Line, its voice-agent layer, at $0.06 per minute of call duration, with an additional $0.014 per minute when using a Cartesia-provided phone number.
That packaging shows the commercial play underneath the benchmark post. Cartesia does not want to be evaluated only as a voice generator against ElevenLabs, Inworld, Google, or the next high-Elo TTS model. Goel is positioning Cartesia as voice infrastructure: Sonic speaks, Ink listens, Line orchestrates, and the SSM story explains why Cartesia believes it can keep latency low as interactions get longer and more multimodal.
Cartesia has been financed for that kind of platform bet. In December 2024, Goel announced a $27 million seed round led by Index Ventures, with participation from Lightspeed, Factory, Conviction, General Catalyst, A*, SV Angel, and angel investors. In March 2025, Cartesia announced a $64 million Series A led by Kleiner Perkins. Cartesia said at the time that the money would go toward research, infrastructure, and models for voice, and that Cartesia had powered millions of calls and helped tens of thousands of creators.
The open question is whether benchmark leadership converts into durable distribution. Leaderboards move. Buyers care about total call success, not just Elo, WER, or first-audio latency. But Goel's Monday post shows Cartesia's intended lane clearly: not a voice clone toy, not a narration tool, and not merely a speech API. Cartesia is trying to own the real-time voice-agent substrate before the category settles on its default stack.