From phone recorder to quote-ready text: the journalist's transcription workflow

Journalist interview transcription works in three steps: upload the recording (phone memo, Zoom file, or URL), let diarization split it into Speaker 1 and Speaker 2, then rename the speakers once and export to DOCX with timestamps for fact-checking. Expect around 90%+ accuracy on a clean in-person recording at 128 kbps+, somewhat lower on Zoom audio, and noticeably lower on 8 kHz phone calls — meaning a phone interview always needs a manual review pass against the audio before quoting.

What accuracy can you expect from interview audio?

Interview audio splits into three buckets, and each plateaus at a different accuracy. A studio-style in-person recording at 128 kbps or higher lands around 90%+ word accuracy — close to the realistic ceiling for conversational speech. Zoom and Google Meet recordings sit somewhat lower, hurt by compression and the occasional dropped frame. Telephony — anything routed through a cellular or landline call at 8 kHz — drops noticeably further. Treat any phone interview transcript as a draft requiring a review pass before you quote from it.

What that lower accuracy means in practice: errors concentrate on proper nouns, technical terms, and anything an accented speaker says quickly. The transcript is still a useful scaffold — finding quotes by ear in a 45-minute file is much slower than scanning text — but every quote you intend to publish needs to be checked against the audio. The audio-to-text engine auto-detects the language from the first 30 seconds, which matters when your source slips between English and a native language mid-answer.

The accuracy you cannot improve by switching tools: phone audio is bandwidth-limited at the carrier, not the transcription stage. No model can recover information the codec already discarded. If quotes are mission-critical, record on a separate device next to the speakerphone — even a phone's built-in voice memo at full bandwidth beats the telephony stream.

Recording setups, ranked by what you get back

Here's how the common interview setups behave once they hit the upload box:

Setup	Bandwidth / format	Expected accuracy	Diarization quality
In-person, two phone memos	256 kbps AAC each	~94%	Excellent (channel split if merged stereo)
In-person, one phone on table	128 kbps AAC mono	~92%	Good (speaker diarization)
Zoom cloud recording	Variable, ~64-96 kbps	~88%	Good — Zoom already separates participants
Speakerphone with phone recording	8 kHz mono	~82%	Mediocre — crosstalk merges turns
Cellular call recorded on second device	8 kHz mono	~80-82%	Same as above

The best outcome from the worst format — a speakerphone call recorded on a phone — comes from one habit: have the source speak first, in isolation, for about 10 seconds (a name and role intro works), then begin the questions. That gives the diarization model a clean reference for Speaker 1 before any crosstalk starts.

How diarization works on a two-speaker interview

Speaker diarization is the process of identifying who spoke when in a recording, then labelling each turn with a speaker ID before transcription. For interviews, this is what turns a wall of text into a readable Q&A.

The mechanism splits in two:

Stereo file with each speaker on a separate channel — for example, a Zoom recording exported with separate audio tracks, or two phone memos merged left-right in a DAW. The system uses a channel-split method: left becomes speaker_0, right becomes speaker_1. 100% confidence, no model inference needed.
Mono file with both voices on one channel — the default for phone calls, speakerphone, single-device recordings. The system runs our speaker diarization to detect speaker turns from voice characteristics. Accuracy is strong on two-speaker interviews with limited crosstalk and degrades when speakers interrupt each other often.

In the transcript view, each turn appears as a bubble with a coloured chip — speaker_0, speaker_1. Click the chip on any bubble and a popover opens with four actions: rename, filter to this speaker, copy this speaker's lines, or jump to their first turn. Rename speaker_0 to "Dr. Mehta" once, and every bubble updates. This is the bit that saves the most time on long interviews — you do not rename 80 turns, you rename one chip.

For multi-source pieces — quoting a panel discussion, a press briefing, a roundtable — the same flow works with three to five speakers, though accuracy on speaker assignment drops as the number rises and as turn lengths shorten.

Try it on your audio

Start free →

90 minutes a month, no card.

The proper-noun problem (and how to handle it)

The errors that bite journalists are almost never the common words. They are names of people, places, organisations, drugs, products, court cases, and acronyms — exactly the words that carry meaning in your story. The model has not seen your source's name before. It will guess phonetically.

A working approach:

Before the interview, write a 5-10 line glossary of the names and terms you expect to come up — your source, their org, the place names, the technical jargon.
After the transcript lands, use the transcript search to find each glossary item. Variations like "Mehta" / "Meta" / "Mehtaa" cluster together — fix them with find-and-replace.
For the proper nouns you missed in the glossary, the timestamps on each line tell you exactly where in the audio to listen back. A 45-minute interview becomes a series of 5-second checks.

The DOCX export preserves the speaker labels and timestamps inline, which is the format most fact-checkers expect to receive alongside the audio file.

Exporting for fact-checking and quoting

Every plan exports the same five formats: DOCX, TXT, SRT, VTT, JSON. For interview work, two of these matter:

DOCX with speaker labels and timestamps — the format you hand to an editor or fact-checker. They can scan the document, click a timestamp note next to a quote, and skip to that point in the audio file. This is the workflow most newsroom standards expect.
TXT clean — when you have already verified quotes and want plain text to paste into your draft, free of timestamps.

Source audio is permanently deleted from infrastructure within 24 hours of the job completing. Transcripts stay in your account until you delete them, and we do not train models on user data — which matters when interviews are under embargo or cover sensitive topics. For workflows that need an off-site copy retained even shorter, download the DOCX and delete the transcript yourself.

What this means for your case

For a daily reporter doing one or two interviews a week, the free tier's 30 minutes per month covers a single short call but not much else — Pro at $19/month for 600 minutes is the realistic baseline. For a long-form journalist doing 4-6 hours of interviews per piece, Pro covers about one feature per month; top up with a one-time minute pack ($5 for 300 minutes, $15 for 1,100, or $39 for 3,500) when you go over.

If you record everything in person at full bandwidth, you can quote directly from the transcript with a quick listen-back on each quote. If your beat is phone-heavy — political reporting, beat coverage of agencies that prefer phone over Zoom — budget time for a full review pass on every interview before any quote leaves the document. The 82% accuracy is not a flaw to engineer around; it is the ceiling of the audio you have.

The REST API is available on every plan if you want to push interview files from a custom recording app or pipe transcripts into a CMS, though most working journalists never need it — the dashboard upload is faster for a handful of files a week.

FAQ

How do I transcribe a Zoom interview that I recorded to the cloud?

Download the cloud recording from Zoom — either the MP4 video file or the M4A audio-only file — and upload it directly. If you exported "Record separate audio files for each participant", merge them into a stereo file first (left = you, right = source) for cleanest diarization. Zoom audio typically transcribes at ~88% accuracy. The whole 60-minute file completes in 9-11 minutes.

Can I transcribe a phone interview without recording the call directly?

Yes — record the call on a second device using its voice memo app while the call plays on speakerphone, or use an in-line call recorder app on the phone making the call. Either way, you will get 8 kHz telephony-quality audio with ~82% transcription accuracy. A separate-device recording is usually marginally better because it captures full-bandwidth audio of at least your side of the conversation.

How do I rename "Speaker 1" and "Speaker 2" to actual names?

In the transcript view, click the coloured speaker chip on any bubble. A popover opens with four actions; pick "Rename". Type the real name once, and every turn by that speaker updates across the entire transcript. You do not rename each turn individually. The renamed labels are preserved in the DOCX, TXT, and SRT exports.

Do I need a separate plan to transcribe interviews in another language?

No. The system supports 99 languages on every plan at one price — there is no tier-1/tier-2 structure. The language is auto-detected from the first 30 seconds of audio, or you can set it manually for noisy files or interviews that mix languages. Accuracy varies by language (English, Spanish, French, German sit at the top; smaller languages can run 15-40% word error rate).

Is the audio kept after transcription? My source needs assurances.

Source audio is permanently deleted from infrastructure within 24 hours of the job finishing. The transcript stays in your account until you delete it manually. We do not train models on user data. For sensitive interviews — embargoed material, confidential sources — download the DOCX immediately and delete the transcript from the account to remove it from our storage as well.

How fast does a 45-minute interview transcribe?

About 7-8 minutes. The system runs at approximately 6× faster than realtime — a 60-minute file completes in 9-11 minutes — using parallel chunking. You can leave the tab, get a coffee, and the transcript is ready when you come back. Longer interviews scale linearly: a 2-hour panel discussion takes around 20-22 minutes.

What if my source has a strong accent?

Accented speech in well-recorded audio still hits the 88-92% range for most major accents the model has seen — South Asian English, West African English, Scottish, Irish, regional American. Where accuracy drops is on the combination of accent + low-bandwidth audio + technical vocabulary. Build a glossary of likely terms before the interview, then use transcript search and replace to fix the recurring mistranscriptions after upload.

Journalist interview transcription: phone to quote-ready