Speech-to-Text & Text-to-Speech
What vendors mean: voice in, voice out. What it actually means: two technologies that quietly got ten times better in two years and changed what's possible in customer-facing AI.
The Technical Definition
Speech-to-Text (STT, also called automatic speech recognition or ASR) converts audio into written text. Text-to-Speech (TTS) does the reverse: written text is rendered as audio in a synthetic voice. Both fields were dominated for two decades by a handful of providers (Nuance, Google, Microsoft) producing acceptable-but-robotic results. The 2022-2026 wave reset the category. OpenAI’s Whisper, Deepgram, AssemblyAI, and Speechmatics moved STT to near-human accuracy on clean audio in dozens of languages. ElevenLabs, OpenAI’s TTS models, and Cartesia produced synthetic voices that pass the casual listening test.
What This Actually Means for Your Business
The combination of high-quality STT and TTS is the reason voice AI agents are actually shipping in 2026 instead of being a perpetual demo. Five years ago, every voice deployment had to apologize for the technology. The synthetic voice sounded synthetic. The transcription got names and product SKUs wrong. The latency between speech ending and response starting was uncomfortable. None of those constraints fully apply now.
For STT, the practical change is multilingual coverage and accuracy on real-world audio. Whisper-class models handle accents, code-switching, and background noise far better than the previous generation. Real-time STT (sub-300ms latency) is now standard. Specialized vocabulary — medical terminology, legal terms, product names — still needs custom acoustic modeling or post-correction with an LLM, but the gap is narrower.
For TTS, the change is voice cloning and emotional control. A 30-second sample is enough to clone a voice well enough that most humans can’t reliably tell the difference in a casual conversation. Voice can be steered for tone, pacing, and emotion. This is good news for accessibility, customer service, and audio content production. It is also the technology underneath a fast-growing class of fraud (CFO impersonation calls, voice-cloned authorization scams) that already cost real companies real money in 2024-2026.
The cost has collapsed. STT that used to cost $0.05 per minute now runs $0.005-0.015 in commodity providers and is approaching free in self-hosted Whisper deployments. TTS that used to require enterprise contracts is now metered per character at low single-digit dollars per million characters. This pricing is the reason voice AI agents can be unit-economic in customer service applications that were prohibitive two years ago.
The thing that hasn’t changed: failure modes are still expensive. An STT mistake on a customer’s account number means the agent looks up the wrong account. A TTS misreading of an unusual product name reads “RX-2400” as “are-ex-twenty-four-hundred” and confuses the caller. Production voice systems still need test suites, glossaries, and fallback paths. The quality bar is much higher because the technology is much better, not in spite of it.
Reality Check
What the vendor says: “Our voice AI is indistinguishable from a human.”
What that means in practice: It’s indistinguishable on a 20-second demo with a script the vendor chose. On a 4-minute call where a customer goes off-script, asks a follow-up the bot wasn’t trained for, or speaks with a strong regional accent, the gap shows up. Test with your actual call types and your actual customer base before you decide.
What Operators Actually Do
Companies running voice in production benchmark on their actual call recordings, not vendor demos. They take 100 calls representing their real customer base — including the calls with the heaviest accents, the most background noise, and the most specialized vocabulary — and measure word error rate per provider. The numbers usually surprise the procurement team.
They also build a domain glossary. Product names, customer-specific terms, drug names, account number formats — these are the words STT gets wrong, and the fix is usually a custom vocabulary list or a post-processing step that runs the transcript through an LLM with context. Skipping this step is the difference between 92% accuracy and 98% accuracy on the calls that matter.
For TTS, the operators who get this right pay attention to brand voice. The default voices from the major providers all sound like the same kind of pleasant-but-generic American radio voice. A bank, a healthcare provider, and a retail brand should not sound identical. Voice selection and pronunciation tuning are part of the brand work, not just a technical procurement.
The Questions to Ask
-
What’s the word error rate on call recordings that look like ours? Word error rate (WER) is the standard metric. Insist on a real test on real audio. Vendor numbers on broadcast English are not the right benchmark.
-
How are domain-specific terms handled? Product names, drug names, customer-specific vocabulary. There needs to be a documented mechanism for adding terms — either custom vocabulary, fine-tuning, or LLM post-correction.
-
What’s the latency from end-of-speech to first audio out? For real-time voice agents, total latency above ~700ms feels broken. Below 300ms feels natural. Get the actual measured number on your network, not the vendor’s data center.