How to edit a transcript in 5 minutes, not 50
To edit a transcription from a 60-minute file in about 5-7 minutes, do four passes in this order: rename each speaker once (the label propagates to every turn), skim the bubble view at reading speed and click any turn that looks off to hear the source audio at that timestamp, filter to one speaker at a time to spot-check proper nouns, then export. At ~92% accuracy you're looking for 200-400 wrong words in a 9,000-word transcript — usually clustered on names, jargon, and crosstalk, not scattered randomly.
Why a 92% transcript still needs editing
A 60-minute conversation produces roughly 9,000 words. At ~92% accuracy on real-world podcast audio at 128 kbps or higher, that's 700-720 correct words for every 80 wrong ones — so around 200-400 errors in the full file. The number sounds bad until you look at where the errors actually land.
In practice, errors cluster. The model gets common verbs and connectives almost perfectly. What it misses: proper nouns (people, products, places it hasn't seen), domain jargon, numbers spoken quickly, and overlapping speech where two voices cross. A reviewer who scans for those four categories — instead of reading every word — finishes a 60-minute transcript in the time it takes to drink a coffee.
The pattern below assumes you're working in the Transcription.Solutions reader, which gives you a bubble layout, click-to-audio on every turn, and a speaker filter that isolates one voice at a time. The same workflow works in any tool with those three primitives — but the timings are based on ours.
The 5-minute review pass — four steps
Step 1 — Rename the speakers once (30 seconds)
The transcript opens with anonymous labels: Speaker 1, Speaker 2, sometimes a third. Click the coloured chip on any turn — a popover opens with four actions, including rename. Type the real name. Every turn by that speaker, past and future in the transcript, updates instantly.
Do this before you read anything. It removes the cognitive overhead of mapping "Speaker 2 said..." to a person while you're reviewing. On a two-person interview this takes 20 seconds; on a four-person panel, under a minute.
Step 2 — Skim the bubble view, click to verify (2-3 minutes)
Read the transcript at skim speed — fast enough that you're processing meaning, not words. When something reads oddly (a name you don't recognise, a number that doesn't fit, a sentence that breaks grammar), click that turn. The audio player jumps to that exact timestamp and plays.
You're not listening to the whole thing. You're checking one second of audio against one bubble of text. Fix the text inline, click the next questionable turn. On a 60-minute file with average audio quality, this surfaces maybe 30-50 spots worth checking, of which 20-30 need an edit.
Step 3 — Filter to one speaker, spot-check proper nouns (1-2 minutes)
Click a speaker chip and choose filter to this speaker. The view collapses to that person's turns only. Read just their lines — your brain holds their vocabulary in working memory, so jargon repetition becomes obvious. A guest who says "Postgres" twelve times will have it transcribed as "post grass" or "postgrad" the once when they mumbled. You'll spot it because every other instance is correct.
Repeat for the second speaker. If your file has more than two speakers, filter only the ones who say technical or proper-noun-heavy content — the host who mostly says "right, sure, go on" doesn't need a pass.
Step 4 — Export (10 seconds)
Pick the format your downstream tool wants: DOCX for editorial review, SRT/VTT for video subtitles, TXT for further AI processing, JSON if you're piping into something custom. All five export formats are available on every plan, including the free tier.
What the bubble + popover layout actually looks like
The reader is a vertical scroll of bubbles — one per turn — coloured by speaker (up to five distinct chip colours). For a single-speaker file (solo podcast, voice memo, lecture), the layout switches to display-quote sizing with no chips, because there's nothing to distinguish.
There is no horizontal timeline view, no multi-track audio editor, no waveform scrubber per speaker. The bubble + click-to-audio + filter pattern handles the editing job without any of that. If you've used a DAW-style transcription tool before and bounced off the complexity, this is the opposite end of that spectrum.
When the 5-minute pass isn't enough
Three audio conditions push the review time longer:
- Phone-quality or 8 kHz audio. Accuracy drops to ~82% (about 17.7% WER per AssemblyAI's published benchmarks). Expect 15-20 minutes of review per hour of audio. For call-centre or telephony recordings, plan for it.
- Heavy domain jargon with no glossary. Medical, legal, and academic recordings have proper-noun density that defeats generic ASR. Budget 10-15 minutes review per hour and keep a find-and-replace list handy.
- Three or more speakers with frequent crosstalk. Speaker diarization handles cleanly-separated turns well, but when two voices overlap the model assigns the overlap to one speaker. Manual fixes on those segments take time the bubble UI can't shortcut.
For studio-grade audio at 192 kbps or higher with two clearly-separated speakers, you can often skip Step 3 entirely and finish in under 4 minutes.
Mono vs stereo recordings — a hidden speed factor
Stereo recordings where each speaker has their own channel (most remote podcast setups using Riverside, Zencastr, SquadCast) get channel-split diarization — speaker assignment is 100% accurate because it's not inference, it's just "this channel = this person." You skip nearly all of Step 3.
Mono recordings — phone audio, single-mic field interviews, downloaded YouTube videos — run through pyannote/speaker-diarization-3.1, which is good but not perfect. Expect to merge or split a few turns where the model put two consecutive sentences under different speakers, or vice versa. The SpeakerPopover handles this in two clicks per fix.
Realistic review times by source type
| Source | Audio quality | Review time per hour of audio |
|---|---|---|
| Studio podcast (stereo, 192 kbps+) | ~95-97% | 4-5 min |
| Remote interview (stereo, 128 kbps) | ~92% | 5-7 min |
| Single-mic field interview (mono) | ~88-90% | 8-12 min |
| Zoom / Meet recording (mono, compressed) | ~88% | 10-15 min |
| Phone call (8 kHz mono) | ~82% | 15-20 min |
| YouTube download | varies | 5-15 min |
These are working numbers from typical files we process, not lab conditions. Your own first review will probably take longer — building the "skim, click, verify, move on" rhythm takes one or two transcripts before it gets fast.
FAQ
How many errors are in a typical auto-generated transcript?
A 60-minute file produces about 9,000 transcribed words. At ~92% accuracy on real-world podcast audio at 128 kbps or higher, that's 200-400 wrong words — roughly one error every 22-45 words. Errors cluster on proper nouns, technical jargon, numbers, and overlapping speech. Connective words and common verbs are nearly perfect, so reading every word is wasted effort.
Can I edit a transcript without listening to the whole audio again?
Yes — that's the point of click-to-audio. Skim the text at reading speed, click any turn that looks suspicious, and the player jumps to that exact second. You verify a one-second clip, fix the text, move on. On a clean 60-minute file you'll spot-check 30-50 moments out of about 1,200 turns, which is roughly 5-7 minutes total.
Does renaming a speaker update every occurrence?
Yes. Speakers are stored as anonymous IDs (speaker_0, speaker_1, etc.) and rendered with a label. Renaming the label in the SpeakerPopover updates every past and future turn by that speaker in one operation. You never have to find-and-replace names manually, even on a four-speaker recording with hundreds of turns.
What if the diarization split one person across two speaker labels?
This happens sometimes on mono recordings, especially when one person changes vocal energy mid-conversation. Filter to each of the two labels, confirm they're the same voice, then merge by renaming both to the same name — the colour chips will still differ, but the label is unified for export. For more aggressive merging, edit the JSON export directly.
Which export format should I use after editing?
DOCX for editorial workflows (tracked changes, comments). SRT or VTT for video subtitles you'll upload to YouTube, Vimeo, or your CMS — we don't burn subtitles into the video file itself, just deliver the timing files. TXT for piping into another AI tool. JSON if you need turn-level timestamps and speaker IDs preserved for a custom pipeline. All five are available on every plan.
Is editing faster on stereo or mono recordings?
Stereo recordings where each speaker has their own channel are dramatically faster to edit, because speaker assignment is determined by channel rather than model inference — there are no diarization errors to fix. If you record remote interviews on Riverside, Zencastr, SquadCast, or similar, request the multi-track or per-speaker stereo export. You'll skip 1-2 minutes of review on every hour-long file.
Can I edit transcripts on the free tier?
Yes. The free tier (30 minutes per month, no card) includes the full reader: bubble UI, click-to-audio, speaker rename, filter-to-speaker, and all five export formats. It's the same editing experience as Pro and Business — the difference is monthly volume and per-file size, not features. Upload one of your real files to see the workflow on your actual audio.
Related reading
- How accurate is automatic transcription, really? — what 92% means by audio type, with the underlying WER numbers
- Stereo vs mono for interview recording — why per-speaker channels save you editing time later
- Exporting SRT and VTT for video subtitles — the post-edit workflow for putting transcripts back onto video
- REST API for batch transcription — when you're processing more files than the dashboard makes sense for