How speaker diarization actually works on your file

Speaker diarization explained in one paragraph: it's the step that decides who spoke when and tags each turn in the transcript. Our pipeline has two paths. If your file is stereo with one person per channel, we split the channels — Speaker 0 is the left track, Speaker 1 is the right, and the assignment is exact. If your file is mono, our speaker diarization estimates turn boundaries from acoustic features, which is usually good but never perfect. Record stereo when you can.

What speaker diarization is, and what it isn't

Speaker diarization is the process of segmenting an audio file into turns and assigning each turn a speaker label — Speaker 0, Speaker 1, Speaker 2 — without knowing in advance who those people are. It answers who spoke when, not who they are by name. Naming happens later, when you click a speaker chip in the transcript and rename it.

It's also separate from recognition. The ASR model turns sound into words; diarization runs alongside and stamps each word with a speaker ID. The two are stitched together so the final transcript reads as a conversation, not a monologue.

One thing worth flagging up front: diarization is the part of the pipeline most likely to produce a visibly wrong result. ASR errors usually look like a misheard word. Diarization errors look like Speaker 1 saying something Speaker 2 actually said — which is more annoying to fix. That's why the two paths below matter.

Path 1: stereo file → channel split

If your upload has two audio channels (left and right), and the speakers were recorded on separate channels, we don't run a diarization model at all. We split the file by channel:

Left channel → speaker_0
Right channel → speaker_1

Internally this method is logged as channel_split with 100% confidence. There's no inference, no acoustic similarity scoring, no risk of confusing two voices that happen to sound alike. If the left mic only ever picked up host audio and the right mic only ever picked up guest audio, the diarization is exact by construction.

This is the path that handles most remote-recorded podcasts. Tools like Riverside, SquadCast, Zencastr and Cleanfeed deliver each participant on their own track. When you export as a single stereo file with one speaker per channel — or pass two mono files we mux — diarization becomes free and perfect.

The catch: if you accidentally feed us a stereo file where both speakers are mixed into both channels (a recording of a Zoom call captured from the system, for instance), the channel split still runs, but Speaker 0 and Speaker 1 will both contain the same conversation at slightly different volumes. That's worse than mono. If in doubt, downmix to mono before uploading.

Try it on your audio

Start free →

90 minutes a month, no card.

Path 2: mono file → model-based diarization

For mono files — one channel, multiple voices mixed together — we run our speaker diarization model. The audio is normalised to 16 kHz, and the model produces a sequence of turns with speaker IDs speaker_0, speaker_1, speaker_2 and onward. The method is logged as pyannote in the JSON export.

The model works in two stages. First, a segmentation network finds boundaries where the voice changes. Then, an embedding network produces a vector for each segment, and clustering groups vectors that sound like the same person. It doesn't need to know how many speakers are in the file — it figures that out — but you get better results when voices are distinct, the recording is clean, and turns don't overlap heavily.

Real-world failure modes we see on this path:

Two voices with similar pitch and accent (e.g. two thirty-something men in the same regional accent) sometimes get merged into one speaker.
One person speaking from two distances (host moves between near and far positions on the same mic) sometimes gets split into two speakers.
Crosstalk — both people speaking at once for more than a second or two — produces noisy boundaries.
Background music or laughter occasionally registers as a new speaker for a few seconds.

When pyannote gets it wrong, the fix is usually fast in the UI. Click a misattributed bubble, reassign it to the right speaker, and move on. But "usually fast" is not "never needed" — which is why the next section exists.

Why this matters for how you record

If you have any control over how the audio is captured, record stereo with one person per channel. It turns a probabilistic step into a deterministic one. A remote-recorded interview on Riverside with separate tracks will diarize perfectly. The same interview captured as a single Zoom recording will go through pyannote and depend on the model's good day.

Practical setups that produce per-speaker channels:

Remote podcast platforms (Riverside, SquadCast, Zencastr, Cleanfeed) — each participant records locally, you get separate tracks. Combine them into a stereo file with host on L, guest on R.
Two-mic in-person interview into a multitrack recorder (Zoom H5, H6, Tascam Portacapture) — record each mic to its own track, export as stereo.
One USB interface, two XLR mics (Focusrite Scarlett 2i2, Audient EVO 4) — pan one mic hard left, the other hard right, record stereo in your DAW.

Setups that produce mono (and therefore go through pyannote):

A single shared microphone in a room.
Phone recordings.
Zoom or Google Meet single-file recordings (they downmix to mono or to a stereo file with both speakers in both channels — also effectively mono for diarization).
Most YouTube and TikTok downloads.

You can still get a clean transcript from a mono recording — see the accuracy detail on the audio to text page — but expect to do a pass through the speaker labels in the UI before you ship.

What the result looks like in the transcript

The transcript readout renders each speaker as a coloured bubble with a chip on top — Speaker 1, Speaker 2, etc. Click a chip and a popover opens with four actions: rename the speaker (so "Speaker 1" becomes "Lena Kovács" for the whole file), filter to show only that speaker's turns, copy that speaker's text, or jump to their first turn.

Names you set are saved with the transcript. They flow through to DOCX, SRT, VTT, TXT and JSON exports. If you use the REST API, the raw output stays at speaker_0, speaker_1 etc. — rename via the API or post-process in your own script.

For interview-heavy workflows where you upload several files per week, this rename-once flow is the part that saves the most time. The interview transcription use case walks through it end-to-end.

FAQ

Does diarization work for files with more than two speakers?

Yes, on the pyannote path. Pyannote-3.1 detects an arbitrary number of speakers — we've seen clean results with up to six on a roundtable recording, and acceptable results with eight or nine. Above that, expect more merges and splits. The channel-split path is two speakers maximum by design, because it's literally splitting the left and right channels of a stereo file.

What happens if my stereo file has both speakers in both channels?

The channel split runs anyway, and you'll get two speakers, but speaker_0 and speaker_1 will each contain the full conversation at different volume levels — which is worse than treating the file as mono. If your "stereo" Zoom export or system-audio capture has both voices in both channels, downmix to mono before uploading so the pipeline routes to pyannote instead.

How accurate is the mono diarization on real podcast audio?

Published diarization error rates for state-of-the-art models run in the 8–14% range on standard benchmark sets, depending on the dataset. Real-world podcast audio with two distinct voices and minimal crosstalk usually lands inside that range. Conversations with similar-sounding voices, heavy crosstalk, or background music drift higher. The UI rename-and-reassign flow exists because no general-purpose diarization model is perfect on every file.

Can I tell the system how many speakers are in my file?

Not currently. Pyannote auto-detects speaker count on mono files and the channel-split path on stereo files is deterministic (left = speaker_0, right = speaker_1). There's no API parameter to hint speaker count — we found in practice that the auto-detection handles the common case (2–6 distinct voices) well, and any wrong split is faster to fix with one click in the rename UI than with a tuning parameter. If you have a workflow that genuinely needs a hint (e.g. always exactly 4 speakers), let us know.

Does diarization run on every file regardless of plan?

Speaker diarization is on by default for Pro and Business plans. On the free tier it runs on stereo files (the channel split is essentially free to compute) but model-based mono diarization is gated to paid plans. See the pricing page for the current breakdown.

Why are my speakers labelled "Speaker 0" instead of names?

The diarization model has no way to know who's speaking — it only knows there are distinct voices. Naming is a human step. Click any speaker chip in the transcript, type the name once, and it propagates to every turn for that speaker across the file and into every export format. The API output stays as speaker_0, speaker_1, etc. so your downstream code has a stable identifier.

Does the model train on my audio?

No. We don't train models on user data. Pyannote-3.1 is a pretrained open-source model we run for inference only. Source audio is permanently deleted from our infrastructure within 24 hours of the job completing. Transcripts stay in your account until you delete them.

Speaker diarization explained: channel split vs pyannote