Speaker labels in transcription.Who said what, and when.

Speaker diarization splits a multi-speaker recording into labelled turns — Speaker 1, Speaker 2 — with timestamps. Rename to real names, get a citable, speaker-attributed transcript.

Drop a file, or pick one

MP3 · WAV · M4A · MP4 · MOV · MKV · OGG · OPUS · FLAC · WEBM — up to 100 MB anonymously

Paste a link, we’ll fetch the audio

YouTube · TikTok · Vimeo · Twitter · SoundCloud · Spotify · 50+ more

Record straight from your browser

Sign up takes 30 seconds — recording opens right after, in the dashboard.

No card required~90s per 60-min fileSRT · VTT · DOCX · TXTFiles auto-deleted in 24h

↓ Four voices, four labels

Multi-speaker in. Speaker-labelled out.

Drop a panel recording, an interview, or a Zoom export — the model splits voices into labelled turns with timestamps. Channel-split for stereo (each speaker on a separate track), pyannote for mono recordings where everyone shares one mic.

Panel · 4 voices · monoREC 14:32.06
en-US auto-detected44.1 kHz mono
~90s
Transcript · diarized4 speakers · 62:11
S1

The thing about diarization is everybody wants it to be one number, but it's really four separate problems stacked.

S2

Detection, attribution, overlap handling, and label persistence across breaks. Different failure modes on each.

S1

Right. So when we say '95% accuracy' on diarization, it depends on which

94%+ speaker attribution accuracySRT · DOCX · TXT · JSON · per-speaker

↓ This is the Speakers tab

Click any speaker. Filter to their turns.

Rename Speaker 1 → Mary Chen, Speaker 2 → David Park. The chip names propagate across the entire transcript, summary, and exports. Filter the transcript to one speaker — useful for journalists pulling quotes from a specific source.

Try diarization on a panel recording

Three ways to separate speakers · honest comparison

Channel-split only, AI integrated diarization, or human relabeling.

Three real ways to get speaker labels in a transcript in 2026. Channel-split is exact but only works if you record stereo. AI integrated handles mono recordings. Human is the legal-grade fallback.

Option 01

Channel-split only

Works only on stereo recordings where each speaker is on a separate audio channel — Zoom/Meet exports, two-mic studio setups.

Accuracy · stereo100%
Works on monoNo
Max speakers2 (per stereo pair)
Overlap detectionPer-channel only
SetupRequires stereo source
CostFree / built-in
Best forTwo-person Zoom recordings · two-mic studio podcast · interview rigs where you control the recording setup.
Option 02

AI integrated diarization

Stereo files use exact channel-split. Mono uses pyannote-3.1 clustering. Same dashboard, same export, same speaker chips regardless of source.

Accuracy · 1–3 speakers≥94%
Accuracy · 4–5 speakers≥90%
Works on monoYes
Max speakersUp to 12
Overlap detectionYes
Cost · per min$0.03 (Pro)
Best forPanel discussions · multi-speaker interviews · field-recorded audio · podcast guest interviews · meeting room recordings · oral history archives.
Option 03

Human relabeling

Person listens to the recording, types speaker labels by hand. Highest accuracy on overlap and label persistence across long files.

Accuracy · any setup98–99%
Works on monoYes
Max speakersUnlimited
Overlap detectionBest in class
60-min file3–6 hours
Cost · per min$1–4
Best forCourt depositions · medical case conferences · academic qualitative research where IRB requires verified attribution · oral history with archival stakes.

Channel-split accuracy from first principles (separate channels are deterministic). AI diarization from pyannote-3.1 published benchmarks. Human relabeling rates from US/UK industry rate cards.

Accuracy · real-world numbers

94%+ on 1–3 speakers. Honest numbers for larger panels too.

Speaker diarization accuracy depends mostly on speaker count and overlap frequency, not language or microphone (those affect transcription itself, not the speaker-label layer). Numbers below come from our internal QA on real customer recordings across 2025.

98%+
Stereo channel-split

Each speaker recorded on a separate audio channel — Zoom/Meet exports, two-mic studio podcasts. Diarization is deterministic at this point.

94%+
1–3 speakers · clean mono

The sweet spot for most professional work. Distinct voices, low overlap, good microphone distance. Usable without a review pass.

91%
4–5 speakers · clean mono

Standard panel scenario. Plan a 1-minute rename pass on the speaker chips after the job finishes; accuracy bumps to 95%+ post-rename.

89%
6+ speakers or high overlap

Conference panels, debate recordings, group brainstorms. Diarization still works but expect 2–3 chip merges that need a manual fix during review.

Common questions

8 things people ask about this.

01Is diarization automatic or do I have to enable it?+
Automatic on every Pro and Business upload. Diarization runs alongside transcription with no toggle or separate configuration — the transcript arrives with speaker labels already applied. Free plan returns the transcript without diarization; upgrade to Pro to get speaker labels on every job.
02How many speakers does it support?+
1–5 speakers. The system runs cleanly at 1–3 (85% confidence) and warns internally at 4–5 (70% confidence). Beyond 5 speakers, plan a manual speaker-correction pass.
03Can I rename Speaker 1 to the person's actual name?+
Yes. Click the speaker label in the transcript viewer, type the real name, and every turn in the transcript updates. Rename is per-job. Recurring speakers can be saved to a speaker library — the rename auto-applies on the next upload.
04Does diarization work for non-English recordings?+
Yes. Speaker separation is acoustic — it doesn't depend on the language being spoken. Word-level transcription accuracy varies by language tier; speaker attribution accuracy is language-independent.
05How does diarization work technically?+
The audio is segmented into voice activity regions, then each region is embedded as an acoustic fingerprint. A clustering step groups similar embeddings into speaker IDs. This is distinct from voice print identification — we don't cross-reference against a database of known voices. We determine who spoke when, not who the person is.
06What happens if two speakers have very similar voices?+
Highly similar voice profiles — identical twins, two people with the same pitch and regional accent — may get merged into one speaker ID. This is the genuine hard case for all acoustic diarization systems. The workaround: record each participant on a separate audio channel and transcribe each track independently.
07Can I get diarization results via the API?+
Yes, on Pro and Business API keys. Pass diarize=true on file upload or diarize: true on a URL job. The completed transcript response includes a speaker field on each utterance object. Speaker IDs are stable within a job (SPEAKER_1 always refers to the same person) but not across jobs.
08Does diarization cost extra?+
No per-minute surcharge. Diarization is included on Pro ($19/mo) and Business ($49/mo) at no additional charge per minute. Free plan does not include diarization — the transcript comes back as a single speaker stream.

Drop something in. See what comes out.

Diarization is included on Pro ($19/mo) and Business ($49/mo). Free plan transcribes the same audio without speaker labels — useful to evaluate raw transcription accuracy first.

Start free transcription