100% scorecarding is finally affordable — here's the math
If your QA team scores 2% of calls by hand, you're flying blind on the other 98%. AI transcription plus rubric-matching changes the unit economics: instead of paying a QA analyst 8 minutes per call to listen, pause, and score, you pay roughly $0.01–0.02 per call to transcribe it and let a model flag the rubric lines a human should look at.
That's the whole pitch. The rest of this article is about where it works, where it breaks, and what to wire up on Monday.
We run AssemblyAI Universal-3 in production for customer support call QA workloads. The accuracy numbers below are ours, qualified by audio condition. The integration notes are from real Zendesk Talk, Talkdesk, and Five9 deployments.
Why 2% sampling misses what actually moves CSAT
A contact center with 50 reps taking 40 calls a day produces 2,000 recordings daily. A two-person QA team listening to 20 calls each pulls a 2% sample. That sample answers exactly one question well: is my median agent following the script?
It answers almost nothing else. The patterns that move CSAT — repeat-contact drivers, the one phrase a new agent uses that tanks resolution, the rep who is sharp at 9 a.m. and degrading by 4 p.m., the product bug surfacing in 40 calls a week — live in the long tail. If a problem shows up in 3% of calls, your 2% sample sees it twice. You won't notice it. You won't be able to tell your product team which SKU it's about.
The same 2,000 daily calls transcribed in full give you 60 hits on that 3% problem, searchable by keyword, agent, queue, and time-of-day. The QA scorecard stops being a coaching artifact and starts being a feedback channel into product and CX ops.
This is the actual unlock — not "score more calls" but "see the 98% you couldn't see before."
Scorecard automation: model flags, humans review
The workflow we see working in production looks like this:
- Call ends, recording lands in your contact center platform.
- Webhook fires to a transcription job — typically a 4–12 minute audio file.
- Transcript comes back with speaker labels and timestamps.
- A rubric pass (LLM or rules) marks each scorecard line: pass / review / unknown.
- A human QA reviewer opens only the flagged calls, jumps to the flagged timestamp, confirms or overrides.
The human is still the scorer of record. The model is doing triage. On a typical 20-line scorecard, maybe 3–5 lines per call get flagged. Reviewer time per call drops from 8 minutes to under 2.
The three-outcome framing matters. "Unknown" is not the same as "pass" — it means audio or transcript quality wasn't sufficient to judge. If one queue has 20% unknowns and another has 3%, that's a recording-quality problem, not an agent-performance problem. Route unknowns to a separate bucket and you'll catch infrastructure issues your old QA process couldn't see.
Don't start with a 40-field scorecard. Pick 6–10 transcript-visible items for the first pilot. Long rubrics produce noisy automation and review fatigue, and they obscure which lines the model can actually handle.
The lines that automate cleanly are deterministic: did the agent say the verification phrase, offer the callback, read the mini-Miranda, confirm the email on file. Add a few high-value semantic flags too — did the customer mention cancellation, chargeback, legal action, or a competitor; did the agent use prohibited refund language; did the customer ask for a supervisor. Those are easy LLM calls against the transcript.
The lines that don't automate cleanly are the soft ones: tone, empathy, ownership. Keep those human. An LLM scoring "empathy" on a transcript is not the same as a person listening to the call.
We've written more about the underlying audio-to-text pipeline and how the rubric layer sits on top of it.
Speaker diarization: stereo wins, mono is a compromise
This is the single biggest accuracy lever in support QA and most teams get it wrong on the first pass.
Stereo recordings — agent on one channel, customer on the other — give you perfect speaker separation. We channel-split before transcription and label each side from the file metadata. There is no diarization error because there is no diarization — it's two mono transcripts stitched on a shared timeline.
Mono mixdowns — both speakers in one waveform — require a diarization model. We use pyannote-3.1. It's good for 2-speaker calls, which most support calls are. But on a mono recording with a customer, an agent, a supervisor barge-in, and an IVR voice in the first 20 seconds, you'll see speaker swaps and merges. Expect 5–10% of turns to be misattributed on mono support audio.
For QA, misattribution is worse than transcription error. "Did the agent say X" is a scorecard line — if the model thinks the customer said it, the line fails wrongly.
Check your contact center settings before anything else. Zendesk Talk, Talkdesk, and Five9 all support dual-channel recording. It's usually an account-level toggle. Flip it on for new calls, accept that historical mono recordings will have weaker diarization, and move on.
8 kHz phone audio: WER ~17.7% — what it means for your scorecard
Telephony audio is sampled at 8 kHz. Podcast and meeting audio is sampled at 16 kHz or higher. The model has half the frequency information to work with on phone calls.
On clean 16 kHz podcast English, Universal-3 lands around 7.88% WER. On 8 kHz telephony, we measure ~17.7% WER. Roughly one word in six is wrong, missing, or substituted.
That sounds catastrophic for QA. It isn't, for two reasons.
First, scorecard items don't depend on every word — they depend on the presence or absence of specific phrases. "Thank you for calling Acme" is recoverable even if "Acme" mistranscribes as "Acne" — a fuzzy match on the rubric handles it. LLMs are also good at inferring meaning from slightly garbled text, so a question like "did the agent verify the PIN" still resolves even if "one two three four" comes through as "one two tree four". Greeting compliance, callback offers, hold etiquette, closing scripts, and disposition statements all survive 17.7% WER fine.
Second, the errors cluster in predictable places: proper nouns (customer names, product SKUs, addresses), numbers spoken quickly, and the first 1–2 seconds after a hold release where compression artifacts spike. If your rubric depends on numeric accuracy — order numbers read back, dollar amounts confirmed — build a secondary check. Don't trust a single ASR pass on a 10-digit order number on 8 kHz audio.
Don't automate these on phone audio:
- Exact PII readback verification (names, account numbers, addresses)
- Anything that hinges on a single keyword in a noisy moment
- Sentiment scoring on a customer who's already shouting — acoustic clipping eats words
For multilingual contact centers, test each major language separately on a sample of your real audio. We support 99 languages at one price, but accuracy still varies by accent, code-switching, and how much domain vocabulary appears in a given queue.
Integration: pulling recordings out of your contact center
Every major contact center platform exposes call recordings via API. The pattern is the same across vendors:
- Zendesk Talk: Voice API exposes recording URLs on the ticket. Webhook on ticket update, fetch the recording, send to transcription.
- Talkdesk: Recordings API with signed download URLs. Webhook on call end.
- Five9: Call Recording API or SFTP drop, depending on contract. Webhook via the Reporting API.
- Genesys Cloud: Recording API with async job creation. Webhook on conversation end.
- Amazon Connect: Recordings land in S3 by default. S3 event → Lambda → transcription job.
We don't ship a native connector for any of these — you wire the webhook. We also don't automatically write QA results back into Zendesk, Salesforce, or your WFM tool. The transcription endpoint takes a URL or an uploaded file and returns JSON with text, timestamps, speaker labels, and confidence scores. Standard REST.
A pragmatic architecture:
- Contact center webhook → small queue (SQS, Cloud Tasks, whatever).
- Worker pulls recording, posts to transcription with stereo flag and language.
- Transcript stored in your warehouse (Snowflake, BigQuery, Postgres) keyed by call ID.
- Rubric scorer runs on the transcript, writes results to a
qa_scorestable. - QA tool — your existing one, or a Retool/Hex front-end — reads flagged calls.
Most teams we work with already have steps 1 and 3. The missing piece is the transcription job and the rubric pass. For mixed workloads — calls plus internal Zoom QBRs plus training sessions — the same pipeline handles all of it. The meeting-bot side runs through Recall.ai for live meetings; the call-recording side is plain file ingestion.
A note on compliance
Support calls often contain PII and sometimes PHI. A few things to be honest about up front.
We do HIPAA-grade data handling at rest — encryption, access controls, deletion endpoints. We are not a HIPAA BAA-covered product yet. If you're doing healthcare-adjacent support (pharmacy, payer, provider services) and need a signed BAA, we're not your vendor today. Email us — we're piloting BAA coverage with a small group.
Two-party consent disclosures are your responsibility on the contact center side ("this call may be recorded"). Once you've handled disclosure at the IVR, the recording can be transcribed without additional consent steps. If the call was not allowed to be recorded in the first place, it should not be transcribed for QA either.
For data scrubbing, we expose an opt-out endpoint at /opt-out/{token} — you can wire it to fire after your scorecard is generated so transcripts don't persist on our side beyond what you need.
If you operate in the EU, the recording itself is the GDPR-sensitive artifact, not the transcript. Your retention policy on recordings should govern the transcript too — same TTL, same deletion path.
Cost math at 10,000 calls/week
Rough numbers, so you can sanity-check ROI before you pilot.
- Average support call: 6 minutes.
- 10,000 calls/week × 6 min = 60,000 minutes/week ≈ 240,000 minutes/month.
- At 1 credit/minute (as of May 2026), this is well above the Business plan's 2,500-minute bundle — talk to us for volume pricing.
- Compare to: one QA analyst at $25/hour scoring 8 minutes/call × 10,000 calls/week = 1,333 analyst-hours/week ≈ 33 FTEs of QA capacity to score 100% manually.
The transcription cost is roughly two orders of magnitude below the manual-scoring cost it replaces. Reviewer headcount doesn't go to zero — you still need humans on flagged calls and on the soft rubric lines — but the ratio inverts. Compared to Rev.com human transcription at ~$1.50/minute, full-coverage QA would cost more than the headcount it replaces; that's the cost ceiling AI transcription breaks through.
What next
- Pull one week of recordings from one queue — even 500 calls is enough to see the pattern. Run them through a 60-minute Free plan upload on the worst-quality queue you have, just to feel the WER on your actual audio.
- Check your dual-channel recording setting today. If it's off, flip it for new calls. Diarization quality on stereo vs mono is the single biggest accuracy variable in your pipeline.
- Build a narrow first scorecard: 5 deterministic lines (greeting, disclosure, hold etiquette, callback offer, closing) and 5 soft lines (empathy, ownership, tone). Automate the first five with pass/review/unknown outcomes. Leave the soft five to humans. Measure reviewer time and unknown-rate per queue before and after.
- If you need a HIPAA BAA, email us before you build — we'll tell you honestly where we are in the pilot.
— Transcription.Solutions Team