Sales call transcripts that survive bad headset audio
Sales call transcription on typical 8 kHz phone audio with a cheap headset lands around 82% accuracy — roughly 17–18% word error rate, the call-centre tier. That sounds rough, but it's lopsided in a useful way: proper nouns, prices, company names, and product references survive better than filler words and back-channels. For CRM logging and call coaching, that's enough. For a regulatory deposition or a contract dispute, it isn't.
Why sales calls land at ~82%, not ~92%
Two things drag the number down, and both are baked into how phone calls work. First, the audio bandwidth: standard telephony samples at 8 kHz, which throws away everything above 4 kHz. Consonant detail — the difference between "fifteen" and "fifty", between "Pat" and "Matt" — lives mostly above 4 kHz. The model is guessing more than it does on a podcast.
Second, the headset. Sales floors run on $40 USB headsets with foam mics inches from a mouth that's also breathing, sipping coffee, and occasionally muting wrong. One party is usually on the move, on speakerphone, or driving. The other end might be a customer on a tin-can mobile in a coffee shop. You're stacking compression artefacts on top of low sample rate.
Our audio-to-text engine hits ~92% on a 128 kbps podcast and around 82% on 8 kHz phone audio. That's the AssemblyAI benchmark — 17.7% WER on call-centre material — and it matches what we see day to day on real sales calls.
What an 82% transcript actually looks like
WER is calculated by counting any wrong, missing, or inserted word as an error. But not all errors hurt equally. Here's the pattern on sales calls specifically:
Survives well:
- Proper nouns spoken clearly once and repeated (company names, contact names, products)
- Prices and round numbers ("nineteen dollars", "two thousand seats")
- Email addresses and URLs when spelled out
- Industry vocabulary the prospect uses more than once
Gets mangled:
- Filler words and back-channels ("um", "yeah", "right, right, right")
- Short common words at clause boundaries ("a"/"the", "and"/"in", "we"/"he")
- Single-digit numbers said quickly ("five" vs "nine")
- Hesitant proper nouns ("It's [garble] — sorry, [garble] Industries")
If your goal is what was agreed and what was promised, the transcript holds up. If your goal is exact verbatim of every word, it doesn't.
What's good enough for CRM logging
For a Salesforce or HubSpot note after a discovery call, you don't need a verbatim record. You need:
- Who was on the call (handled by speaker diarization — see below).
- What product or use case the prospect mentioned.
- Any prices, contract terms, or numbers either side floated.
- The next step and who owns it.
All four extract reliably from an 82% transcript, because they're usually said more than once and they're the kind of content humans repeat for emphasis. The AI summary we run on every job pulls these into a short paragraph plus action items — the DeepSeek model sees the noisy transcript but the redundancy in sales conversations means the summary lands cleaner than the raw text.
Stereo recordings help a lot here. If your dialer records the two sides on separate channels — most do — we run channel-split diarization instead of model-based diarization: left channel is speaker_0, right is speaker_1, 100% confidence, no model inference. Rename them to "Rep" and "Prospect" once in the bubble transcript view and the CRM-ready output writes itself.
When NOT to use auto-transcription on sales calls
Be honest about the line. An 82%-accurate transcript is not a legal record.
| Use case | 82% transcript fit |
|---|---|
| CRM note / call summary | Yes — summary works, key facts survive |
| 1:1 call coaching with the rep | Yes — both of you can hear the call, transcript is a scrubbing aid |
| Pipeline review across 20 calls | Yes — search across transcripts beats relistening |
| Quoting the prospect in a follow-up email | Only if you re-listen to verify the quote |
| Compliance evidence (MiFID, FINRA) | No — use your dialer's compliance recording, not an AI transcript |
| Contract dispute / "what did they promise" | No — re-listen, or pay for human transcription |
| Regulatory deposition | No — use a certified human transcriber |
The rule of thumb: if the transcript is an aid to a human who can re-check the audio, 82% is fine. If the transcript is going to be cited instead of the audio in a high-stakes context, it isn't.
Practical workflow for a sales team
A workable pattern that doesn't require new tools:
- Record on your existing dialer (Gong, Aircall, Dialpad, Zoom Phone — anything that gives you an MP3 or WAV).
- Drop the file into the dropzone or POST it to the REST API from your CRM integration. Stereo recordings get channel-split diarization automatically.
- Get back a transcript with speaker labels, an AI summary, and action items — usually in under 10 minutes for a 30-minute call. We run at approximately 6× faster than realtime — a 60-minute file completes in 9–11 minutes.
- Paste the summary into your CRM. Keep the full transcript searchable for when you need to find "that prospect who mentioned the SOC 2 audit".
Source audio is permanently deleted from our infrastructure within 24 hours of job completion. Transcripts stay in your account until you delete them. We do not train models on your data — relevant if your sales calls discuss confidential pricing or unannounced product plans.
For a rep doing 8 calls a day at 30 minutes each, that's 4 hours of audio — ~240 minutes — which fits comfortably inside the Pro plan's 600 minutes per month if you only transcribe meaningful calls (demos, discoveries, closes) rather than every dial.
FAQ
Why is sales call transcription less accurate than podcast transcription?
Phone audio is sampled at 8 kHz, cutting off frequencies above 4 kHz where consonants live, so the model has less detail to work with. Add a cheap headset, speakerphone, or a prospect on mobile, and you stack compression on top. Real-world podcasts at 128 kbps land near 92% accuracy; 8 kHz phone audio lands near 82%. That's not a flaw in the model — it's the bandwidth limit of telephony.
Can I transcribe Gong, Aircall, or Dialpad recordings?
Yes — any dialer that exports an MP3, WAV, M4A, or MP4 works. Download the recording from your dialer and upload it, or use the REST API to push files automatically. If the dialer records the rep and the prospect on separate stereo channels (most do), we use channel-split diarization, which is 100% accurate at separating who said what.
Does speaker diarization work on phone calls?
Yes, and it works better than on most other audio types if your dialer records in stereo. Stereo phone recordings get split by channel — left becomes speaker_0, right becomes speaker_1 — with no model inference needed. Mono recordings use pyannote-based diarization, which still works but is less reliable when two voices overlap or when one party speaks much more than the other.
Is the transcript good enough to use as legal evidence?
No. At ~82% accuracy on phone audio, an automated transcript is an aid for humans who can re-check the recording, not a substitute for the recording itself. For compliance, depositions, or contract disputes, use the original audio (preserved by your dialer's compliance recording) or pay for certified human transcription. The auto-transcript is for working notes, coaching, and CRM logging.
How do I get usable CRM notes from a noisy call?
Run the call through the transcriber and use the AI summary, not the raw transcript. Summaries cope with noisy input because sales conversations are repetitive — prospects restate prices, products, and timelines multiple times, so the important facts survive even when filler words don't. Paste the summary into your CRM and link to the full transcript in case you need to search it later.
What's the cost per sales call?
On the Pro plan at $19/month, the first 600 minutes are included — about 20 thirty-minute calls or 10 hour-long demos per month. Overage is $0.04/minute on Pro and $0.02/minute on Business, so a 30-minute call beyond quota costs $0.90 and a 60-minute call costs $1.80. For higher-volume teams, the Business plan at $49/month includes 2,500 minutes.
Will this work if I record both sides on the same channel?
Yes, but accuracy on speaker labels drops. Mono recordings use pyannote diarization, which infers who's speaking from voice characteristics. It works well when voices are distinct and turns are clean, but stumbles on overlap and on calls where one party dominates. If you have the choice in your dialer settings, record in stereo — the diarization quality jump is significant and the file size cost is minimal.
Related reading
- How accurate is automatic transcription? — the per-bitrate and per-source-type breakdown
- Interview transcription with speaker diarization — the same diarization tooling, applied to journalism workflows
- REST API reference — pushing dialer recordings to the transcriber from your CRM
- Pricing — Pro vs Business for sales-team volume