Transcription accuracy explained: what 95% really means
Transcription accuracy is almost always expressed as 100% − WER, where WER (word error rate) counts every substitution, insertion, and deletion against a human reference transcript. So a vendor quoting "95% accuracy" is claiming 5% WER — usually measured on clean studio benchmarks like LibriSpeech, not on the audio you actually have. On real-world podcast audio at 128 kbps or higher, the current plateau is closer to ~92% (7.88% WER, per AssemblyAI's published Whisper Large-v3 benchmark). Meeting audio sits around 88%. Telephony around 82%.
What "word error rate" actually counts
Word error rate (WER) is the share of words in a transcript that differ from a human reference, summed across three error types: substitutions (wrong word), insertions (extra word), and deletions (missing word). It's computed as (S + I + D) / N, where N is the total words in the reference.
That means a transcript can have a low WER and still feel wrong. Punctuation isn't counted. Capitalisation isn't counted. Speaker labels aren't counted. "Gonna" vs "going to" is a substitution and costs you. Misspelling a proper noun once costs you the same as misspelling "the" once.
The inverse — accuracy — is just 100% − WER. There's no separate metric. When a marketing page says "99% accurate", they mean 1% WER. When it says "99.8%", they mean 0.2% WER, which is below human-level inter-annotator agreement and almost certainly only true on a single curated test set.
Why marketing pages quote wildly different numbers
The WER you measure depends entirely on the audio you measure it on. The same model can score 2% WER on LibriSpeech (audiobooks recorded in studios) and 18% WER on call-centre audio. Both numbers are real. Only one matches what you'll hear.
Three benchmarks are commonly cited:
- LibriSpeech test-clean — audiobook recordings, single speaker, professional mic, no overlap. Most models hit 2-3% WER here. This is where "97%+" claims come from.
- TED-LIUM — TED talks. Single trained speaker, decent audio, scripted-ish content. 4-6% WER for modern models.
- Earnings calls / meetings / podcasts — multi-speaker, real-world acoustics, conversational speech with disfluencies. 8-12% WER, which is where the real numbers live.
A vendor that quotes one accuracy number and doesn't say what benchmark it's against is hiding the conditions. The honest version is a table per use case, which is why our accuracy page breaks it down by audio type.
Where the 92% comes from
The 92% figure we quote for podcast transcription comes from AssemblyAI's published benchmark of Whisper Large-v3 on real-world podcast audio: 7.88% WER. That's the plateau for the current generation of models on this audio class.
The plateau matters. For a long stretch (roughly 2016 to 2022), each new model generation knocked another point or two off WER on real-world audio. Since Whisper Large-v3 and the equivalent commercial models (AssemblyAI Universal-3, Deepgram Nova-2), the curve has flattened. New releases mostly improve diarization, latency, and language coverage — not raw WER on English podcasts.
On Transcription.Solutions, the primary ASR is AssemblyAI Universal-3, with Whisper Large-v3 as a transient-error fallback. Both sit on the same plateau. We don't quote 95% because we can't reliably hit it on customer audio.
Accuracy by audio type — the honest table
This is what the numbers actually look like across the audio we see most often. WER and accuracy are two ways of saying the same thing; both are listed so you can match whichever number a vendor put in front of you.
| Audio type | Typical WER | Accuracy |
|---|---|---|
| Studio / audiobook (LibriSpeech) | ~3% | ~97% |
| Podcast at 128 kbps+, two hosts | ~8% | ~92% |
| Zoom / Meet recording, 3-5 speakers | ~11% | ~88% |
| Conference call, 8 kHz telephony | ~18% | ~82% |
The drop from 92% to 88% on meetings isn't the model getting worse — it's the audio. Multiple speakers, cross-talk, laptop mics, room reverb. The drop from 88% to 82% on telephony is the sample rate: 8 kHz cuts off the frequencies that distinguish similar consonants (f/s/th/sh), and no model recovers what isn't in the signal.
Re-recording lifts you more than re-modelling does
Once you're on the plateau, the biggest accuracy gains come from the audio side, not the model side. A few before-the-recording changes that move the WER more than switching vendors will:
- Use a real microphone, even a cheap one. A $60 USB condenser at 30 cm beats AirPods at 1 m by 2-4 WER points on conversational speech.
- Record each speaker on their own track. Stereo files with one speaker per channel get channel-split diarization at 100% confidence — no model inference, no confused turns.
- Encode at 128 kbps or higher. Below 96 kbps, WER climbs about 5 points. Below 48 kbps, expect a manual review pass.
- Avoid 8 kHz telephony if you can. Recording the local side of a call at full bandwidth — even just from the laptop — beats the down-sampled call recording on both ends.
The model isn't going to get 4 points better next year. Your microphone choice can move it 4 points this week.
What this means if you're picking a tool
If a vendor quotes a single accuracy number with no audio-type breakdown, mentally subtract 4-6 points before you trust it. The ~95-99% figures on most pricing pages are LibriSpeech numbers. On your actual audio — interviews, podcasts, meetings — expect 88-92% from any current top-tier model, regardless of branding.
The differentiator isn't WER anymore. It's diarization quality, summary quality, export formats, API ergonomics, and how the tool handles your specific edge case. For a journalist running interviews, speaker labelling matters more than the third decimal point of WER. For a podcaster doing show notes, the AI summary matters more than re-running the transcript through a second model.
Try your own audio on the free tier before you decide. 30 minutes is enough to see the WER you'll actually live with.
FAQ
What is a good word error rate for transcription?
For real-world audio, anything under 10% WER (90% accuracy) is good. Under 8% (92%+) is at the current state of the art for podcasts and clean conversational audio. Under 5% is achievable only on studio-grade single-speaker recordings — audiobooks, scripted narration, voice-over. If a tool quotes sub-3% WER as its headline number, ask which benchmark; it's almost certainly LibriSpeech test-clean.
Why do two transcription tools give different accuracy on the same file?
Different models tokenise differently, handle disfluencies differently, and decide differently between "um" and silence. On a 60-minute file you'll see roughly 1-3 WER points of variation between top-tier models even when both are on the plateau. Most of that variance is in proper nouns, technical jargon, and informal speech ("kinda", "wanna"). Neither output is "wrong" — they're calibrated differently.
Does higher audio quality always mean higher accuracy?
Up to a point. Going from 32 kbps to 128 kbps cuts WER significantly (about 12 points in our measurements). Going from 128 kbps to 320 kbps does almost nothing — the model has all the information it needs at 128 kbps. The accuracy ceiling at that point is set by the model, not the file. Microphone choice and recording environment matter more than bitrate above 128 kbps.
How is transcription accuracy measured?
Accuracy is measured by aligning the machine transcript against a human reference transcript and counting word-level differences. The formula is WER = (substitutions + insertions + deletions) / total reference words, and accuracy is 100% − WER. Punctuation, capitalisation, and speaker labels are not counted in standard WER. Benchmarks are published against fixed test sets — LibriSpeech, TED-LIUM, Common Voice — so different vendors' numbers are comparable only if they use the same set.
Why is telephony audio so much harder to transcribe?
Phone audio is sampled at 8 kHz, which means frequencies above 4 kHz are gone. Many consonants — f, s, th, sh — are distinguished primarily by their high-frequency content. When that content is missing, even a perfect model can't recover it. Expect ~82% accuracy on 8 kHz call recordings versus ~92% on the same conversation captured at full 16-44 kHz bandwidth.
Can AI transcription reach 100% accuracy?
No, and neither can humans. Inter-annotator agreement — two professional transcribers given the same audio — sits around 96-98% on real-world conversational audio. People disagree on disfluencies, overlapping speech, and "um" vs "uh". Any tool claiming 99.8%+ on real audio is either measuring on studio benchmarks or comparing against a reference transcript that was itself produced by the same kind of model.
Related reading
- Speaker diarization explained: channel split vs pyannote — why stereo recording beats any diarization model.
- Transcription API in one screen: POST a file, GET a transcript — how to wire transcription into your own product.
- Sales call transcription on bad headset audio: what works — the practical playbook for ~82% telephony accuracy.