Speaker labels in transcription — diarization, identification, timestamped turns

Speaker labels in transcription.Who said what, and when.

Speaker diarization splits a multi-speaker recording into labelled turns — Speaker 1, Speaker 2 — with timestamps. Rename to real names, get a citable, speaker-attributed transcript.

Drop your audio or video

MP3 · WAV · M4A · MP4 · MOV · MKV · OGG · OPUS · FLAC · WEBM — up to 100 MB anonymously

Paste a link, we’ll fetch the audio

YouTube · TikTok · Vimeo · Twitter · SoundCloud · Spotify · 50+ more

Record straight from your browser

No card required~90s per 60-min fileSRT · VTT · DOCX · TXTFiles auto-delete in 24h

Multi-speaker in. Speaker-labelled out.

Drop a panel recording, an interview, or a Zoom export — the model splits voices into labelled turns with timestamps. Channel-split for stereo (each speaker on a separate track), clustering diarization for mono recordings where everyone shares one mic.

Panel · 4 voices · monoREC 14:32.06

en-US auto-detected44.1 kHz mono

~90s

Transcript · diarized4 speakers · 62:11

The thing about diarization is everybody wants it to be one number, but it's really four separate problems stacked.

Detection, attribution, overlap handling, and label persistence across breaks. Different failure modes on each.

Right. So when we say '95% accuracy' on diarization, it depends on which

94%+ speaker attribution accuracySRT · DOCX · TXT · JSON · per-speaker

Click any speaker. Filter to their turns.

Rename Speaker 1 → Mary Chen, Speaker 2 → David Park. The chip names propagate across the entire transcript, summary, and exports. Filter the transcript to one speaker — useful for journalists pulling quotes from a specific source.

app.transcription.solutions / panel-conference-q3-keynote.mp3Export

Summary 5Transcript 1,420Speakers 2Exports

panel-conference-q3-keynote.mp362:11192 kbps · 44.1 kHz mono4 speakersen-US auto-detected

Speaker diarization splits one audio file into labelled turns; rename them once and the labels propagate everywhere.

Sample preview from a 4-speaker panel discussion about audio engineering tradeoffs. The Speakers tab lets you rename labels, filter the transcript to one voice, and export per-speaker SRT files — workflow used by journalists, qualitative researchers, podcast editors.

Key points

1–3 speakers ≥94% attribution accuracy on clean audio. The cleanest case.

4–5 speakers ≥90% with a 1-minute rename pass on the speaker chips. Standard panel scenario.

Stereo recordings channel-split is exact — each speaker on a known track. Zoom and Meet recordings work this way.

Mono recordings with everyone sharing one mic use a clustering model.

Overlap detection the hardest case. Both speakers labelled, but the overlapping words may be wrong.

Action items

Speaker 1Rename Speaker 1 → 2 → 3 → 4 with actual panel names.

Speaker 2Filter transcript to Speaker 3 (the customer-side voice) for pull-quotes.

Speaker 1Export per-speaker SRT for the social-clip cuts.

Speaker 2Flag overlap moments at 14:32, 28:11, 47:08 for manual review.

Auto-taggedpanel discussion4-speaker diarizationaudio engineeringrename + filter workflow

Try diarization on a panel recording

Option 01

Channel-split only

Works only on stereo recordings where each speaker is on a separate audio channel — Zoom/Meet exports, two-mic studio setups.

Accuracy · stereo100%

Works on monoNo

Max speakers2 (per stereo pair)

Overlap detectionPer-channel only

SetupRequires stereo source

CostFree / built-in

Best forTwo-person Zoom recordings · two-mic studio podcast · interview rigs where you control the recording setup.

Option 02

AI integrated diarization

Stereo files use exact channel-split. Mono uses acoustic clustering. Same dashboard, same export, same speaker chips regardless of source.

Accuracy · 1–3 speakers≥94%

Accuracy · 4–5 speakers≥90%

Works on monoYes

Max speakersUp to 12

Overlap detectionYes

Cost · per min$0.03 (Pro)

Best forPanel discussions · multi-speaker interviews · field-recorded audio · podcast guest interviews · meeting room recordings · oral history archives.

Option 03

Human relabeling

Person listens to the recording, types speaker labels by hand. Highest accuracy on overlap and label persistence across long files.

Accuracy · any setup98–99%

Works on monoYes

Max speakersUnlimited

Overlap detectionBest in class

60-min file3–6 hours

Cost · per min$1–4

Best forCourt depositions · medical case conferences · academic qualitative research where IRB requires verified attribution · oral history with archival stakes.

Channel-split accuracy from first principles (separate channels are deterministic). AI diarization from published clustering-model benchmarks. Human relabeling rates from US/UK industry rate cards.

8 things people ask about this.

01Is diarization automatic or do I have to enable it?+

Automatic on every Pro and Business upload. Diarization runs alongside transcription with no toggle or separate configuration — the transcript arrives with speaker labels already applied. Free plan returns the transcript without diarization; upgrade to Pro to get speaker labels on every job.

02How many speakers does it support?+

1–5 speakers. The system runs cleanly at 1–3 (85% confidence) and warns internally at 4–5 (70% confidence). Beyond 5 speakers, plan a manual speaker-correction pass.

03Can I rename Speaker 1 to the person's actual name?+

Yes. Click the speaker label in the transcript viewer, type the real name, and every turn in the transcript updates. Rename is per-job. Recurring speakers can be saved to a speaker library — the rename auto-applies on the next upload.

04Does diarization work for non-English recordings?+

Yes. Speaker separation is acoustic — it doesn't depend on the language being spoken. Word-level transcription accuracy varies by language tier; speaker attribution accuracy is language-independent.

05How does diarization work technically?+

The audio is segmented into voice activity regions, then each region is embedded as an acoustic fingerprint. A clustering step groups similar embeddings into speaker IDs. This is distinct from voice print identification — we don't cross-reference against a database of known voices. We determine who spoke when, not who the person is.

06What happens if two speakers have very similar voices?+

Highly similar voice profiles — identical twins, two people with the same pitch and regional accent — may get merged into one speaker ID. This is the genuine hard case for all acoustic diarization systems. The workaround: record each participant on a separate audio channel and transcribe each track independently.

07Can I get diarization results via the API?+

Yes, on Pro and Business API keys. Pass diarize=true on file upload or diarize: true on a URL job. The completed transcript response includes a speaker field on each utterance object. Speaker IDs are stable within a job (SPEAKER_1 always refers to the same person) but not across jobs.

08Does diarization cost extra?+

No per-minute surcharge. Diarization is included on Pro ($19/mo) and Business ($49/mo) at no additional charge per minute. Free plan does not include diarization — the transcript comes back as a single speaker stream.