DecisionPick the recording method, get the right path. The same transcript pipeline serves journalists, UX researchers, and oral historians — but the prep is different for each.
SubjectInterview Transcription Decision Guide
Recording methods8 common, 1 pipeline
OutputSpeaker-labelled · Timestamped · DOCX
PrivacyAudio deleted in 24h, no training

Interview transcription. Different recording,
same result.

Journalists, UX researchers, oral historians, and lawyers all transcribe interviews — but each starts from a different recording. This page maps the eight common recording methods to their accuracy expectations, the prep step each needs, and the citation discipline that follows.

Speaker labels with manual rename· Timestamp per turn for citation· Source audio deleted in 24 hours· We don't train models on your data
Recording methods
8
Voice memo, USB mic, handheld field recorder, Zoom export, Squadcast, WhatsApp note, phone call, conference room. All accepted; each has a different accuracy ceiling.
Diarization sweet spot
2
Two-speaker interviews get the cleanest separation. Three-speaker panels work. Four-plus needs a manual review pass — plan accordingly.
Field-tested fields
5
Investigative journalism, UX user research, oral history archiving, legal depositions, qualitative academic research. Each field below.
DefinitionReference passage

Interview transcription converts a recorded conversation between two or more speakers into speaker-labelled, timestamped, citable text. The transcript pipeline is one — but the prep is different for each recording method: a phone voice memo needs nothing, a Zoom recording needs to be downloaded as MP4 first, a handheld field recorder dump needs the SD card extracted. The decision tree below maps the eight common methods to the right path. Transcription.Solutions returns speakers separated and labelled, timestamps per turn for citation, optional AI summary with key-quote extraction, and exports tailored to research workflows — DOCX for editorial review, plain TXT for downstream tooling, JSON via API. The source audio is permanently deleted from our infrastructure within 24 hours of completion; we do not train models on your data.

DecisionMap your recording to the right path

How are you recording the interview?

Different recording methods have different accuracy ceilings and different prep work. Find your method below for what to expect and what to do before upload.

How are you recording the interview?
USB / shotgun mic into recorder
Best case. Drag the WAV/MP3 in directly. Expect 95%+ accuracy on tier-1 languages, clean speaker separation. The default recommendation for any interview you'll publish or cite.If you're recording two people, use two mics into separate channels — diarization gets perfect separation and you can mute one side while editing.
Handheld field recorder (Zoom H1n, Tascam DR-05)
Excellent. Pull the SD card, copy the WAV file, drop it in. Same 95%+ ceiling as a USB mic in a quiet room. Watch for wind noise on outdoor recordings — plug the foam cap on.
iPhone Voice Memos / Android Recorder
Strong default. The phone mic in a quiet room is genuinely good. Voice Memos saves M4A; drag it in. Expect 92–95% accuracy. For a critical interview, hold the phone closer than you think.For two-speaker interviews, place the phone equidistant from both speakers, not closer to you.
Zoom / Google Meet / Microsoft Teams
One prep step. We do not transcribe Zoom URLs directly (they require login). Download the MP4 recording from Zoom, then drag the file in. Diarization works on the local + remote audio mix; expect ~94% on host side, ~88% on guest side.If your account has separate-track recording enabled, transcribe each track separately for cleaner diarization.
Squadcast / Riverside / Cleanfeed
Equivalent to Zoom but better. These services record locally on each speaker's device — drag both files in or use the platform's combined export. The local-recording quality means accuracy lands closer to 95% throughout.
WhatsApp / Telegram voice notes
Direct upload. OPUS / OGG files accepted. Voice notes from WhatsApp arrive at 16 kHz mono — fine for transcription, ~93% accuracy. Useful for journalist source-callbacks where the source dictates from a phone.
Phone call (recorded via app)
Lower ceiling — plan an editorial pass. Phone audio is 8 kHz, half the bandwidth of broadcast — pushes accuracy to 80–85%. Diarization still works on the two-channel call, but proper-noun errors are more common. Single read with the audio playing catches them quickly.
Conference room ceiling mic / one laptop mic for 4 people
Hardest case. Multi-speaker mixed into one channel, room reverb, distance from speakers. Expect 80–85% accuracy and frequent diarization merges on speakers seated next to each other. Plan a manual speaker-correction pass before quoting.If possible, ask each participant to record on their phone too as a backup — even one well-placed phone mic dramatically improves results.
Output6 deliverable elements

What interviewers actually use

01

Speaker labels with rename

Two-person interview: clean separation. Three-person interview (e.g. UX researcher + two participants): usually clean. Manual rename once and every turn updates. Crucial for citation discipline.

02

Timestamps every turn

Each speaker turn carries its own timestamp. "Source said X at 14:22" links back to the exact moment for verification. Click the timestamp to scrub the audio there. Standard for fact-checked journalism.

03

AI summary pulls quotes

On Pro and Business, the AI summary extracts the most-quotable lines with their timestamps. A 90-minute interview produces a list of 8–15 citable quotes you'd otherwise re-listen for. Saves the second listen pass.

04

Search across all interviews

If you do a multi-interview project (a UX research round, a long-form investigation), search hits the words inside every transcript. "Did anyone mention X?" goes from re-listening to 5 seconds.

05

DOCX for editorial review

Clean Microsoft Word export with speakers, timestamps, and paragraph breaks. The format every editor and IRB committee expects. Or plain TXT if you prefer to massage in your own pipeline.

06

Privacy-first by default

Source audio is deleted from our infrastructure within 24 hours after transcription completes. Transcripts stay in your account until you delete them. We don't train models on your data; the upstream ASR provider operates under no-training paid endpoints.

5.0 / How you recordedEach method has different audio properties

Common interview recording methods

Diarization quality and word accuracy depend on the recording. The methods below are listed roughly in descending audio quality. None of them are "too low quality to bother with" — they each have a sensible accuracy expectation.

QualityWhat to expect, honestly

Accuracy on real interview audio

Interviews are some of the most demanding audio for ASR — multiple speakers, varied recording quality, accents, sometimes deliberate quietness for sensitive topics. Honest expectations below.

94%+
On a two-person interview recorded on a decent mic in a quiet room. This is the typical UX research session, journalism interview, or oral-history recording with a handheld recorder.
What we deliver
94%+

Two-person interview, controlled.

USB mic, handheld recorder, or Voice Memos in a quiet room. Two clearly-recorded speakers. Diarization is excellent; word accuracy is reliably above the threshold for fact-check quoting.

  • UX user-research sessions in a quiet office
  • Journalism interviews in a coffee-shop back room
  • Founder interviews recorded with a USB mic
  • Oral-history recordings with a handheld
What's normal
85%+

Field interviews, panels, calls.

Phone recordings, conference-room ceiling mics, three-person panels, slight background noise. Speaker labels still work; word accuracy lands here. Plan a single editorial pass before quoting.

  • Phone-interviewed sources
  • Group focus sessions with 3–4 participants
  • Conference panel discussions
  • Outdoor or street interviews on a phone
What we don't do, and what hurts accuracy

Off-the-record signalling

We can't tell when an interviewee says "this part is off the record" — the transcript captures everything. Plan how you handle off-the-record material before recording. Many journalists keep a separate redacted DOCX.

Heavy crosstalk

Three or more interviewees talking over each other — common in family-style oral histories, group focus sessions — pushes accuracy below 85%. Diarization may merge two voices it can't separate. Mark these as priority for a manual review pass.

Sensitive-content recordings

We process the audio on our servers and delete it within 24 hours. We don't train models on your data. But for HIPAA-regulated medical interviews or legal-privileged depositions, talk to your compliance team before uploading. Our standard contract works for most journalism / research / UX use; specialised contexts need their own review.

Whisper-recorded sensitive interviews

Interviews recorded at very low volume (whispered, distance-recorded, or capturing a hesitant source) drop to 70–80% accuracy regardless of language. We can't enhance the signal that isn't there. Plan a careful manual pass.

Worked exampleFrom the inbox of an investigative journalist

How a six-month investigation transcribed 88 source interviews

A long-form investigative reporter — six-month exclusive on a regulated industry. 88 source interviews recorded on a mix of USB mic (in-person, 41), iPhone Voice Memos (field, 33), and Zoom export (regulator interviews, 14). Citation discipline mattered: every quote in the final piece needed a verifiable timestamp.
01

Uploaded as recordings landed

Each interview transcribed within an hour of recording — same-day, while the conversation was still fresh. Renamed Speaker 2 to the source's pseudonym in the speaker library.

~12 min/interview
02

Coded transcripts in real time

Used inline tags ("#regulator", "#anonymous", "#offrecord") to mark themes and citation status. Off-record sections were tagged for redaction in the published piece but kept in the transcript for context.

~25 min/interview
03

Searched the corpus when fact-checking

When the editor flagged a claim during fact-check, opened the dashboard search, typed the keyword, jumped to the moment in the audio. The verification cycle shrank from "come back tomorrow" to "five seconds".

Replaced 3-day cycles
04

Built citation footnotes from timestamps

Every quote in the published piece carried an internal footnote: source pseudonym + recording number + timestamp. Editor's fact-check pass took two days instead of two weeks.

~10 h saved
Published outcome
The investigation ran in **four installments over three weeks**. Every quote was defendable to a citation timestamp. Total transcription cost across the project: $147 — less than a single day of a freelance transcriber.
Total interviews88
Total cost$147
Per interview$1.67
Equivalent at $1/min$5,280
Citation discipline From a working investigative reporter
5 sec

Time to verify a quote during fact-check, down from "come back tomorrow". The single change that turned transcripts from a debt into an asset across a six-month investigation.

Q&ACommon questions

Twelve more questions, twelve more answers

If the decision tree didn't cover your recording, the FAQ below catches the long tail. Twelve concrete questions from working journalists, researchers, and lawyers — answered honestly.

  1. 01Are speaker labels accurate enough to cite from?
    On a two-person interview with reasonable audio quality, yes — we recommend renaming Speaker 1 / Speaker 2 to actual names once, then quoting directly. On three-person interviews, expect occasional merge errors; verify each quote's speaker before publishing. On four-plus speakers, plan a manual review pass.
  2. 02Can I get a timestamp per speaker turn?
    Yes. Each speaker turn carries its own timestamp in the transcript. Citation discipline: "Sarah, 14:22 in the recording." Click the timestamp in our viewer to scrub the audio to that moment.
  3. 03What audio formats work for interviews?
    Common interview recording formats all work directly: MP3, WAV, M4A (iPhone Voice Memos), OGG / OPUS (WhatsApp / Telegram voice notes), FLAC, MP4 / MOV / MKV for video interviews. Up to 4 hours per file on Business. No conversion needed.
  4. 04Does it work for non-English interviews?
    Yes — 99 languages. Auto-detection picks the language; force a specific one in advanced settings if a multilingual interview confuses detection. Tier-1 (English, Spanish, German, French, Portuguese, Italian, Dutch, Polish) gets studio-grade accuracy. Tier-2 (Russian, Japanese, Mandarin, Korean and 9 more) gets production-grade. Tier-3 (Arabic, Hebrew, Hindi, etc.) is usable but plan a careful review pass.
  5. 05Is it private enough for sensitive interviews?
    Source audio is deleted from our infrastructure within 24 hours of completion. Transcripts and summaries stay in your account until you delete them. We do not train models on your data. For HIPAA-regulated medical interviews, legal-privileged depositions, or other specialised compliance contexts, talk to your compliance team before uploading — our standard contract may not cover those cases out of the box.
  6. 06How accurate is it on Zoom-recorded interviews?
    About 94% on the host side (local mic), 90% on the guest side (internet-quality audio). Diarization separates the two cleanly. Plan a single editorial review for anything published without a re-listen.
  7. 07Can I share a transcript with my editor or IRB?
    Yes — export DOCX (the format every editor expects), plain TXT, or the underlying JSON via API. The DOCX includes speaker labels, paragraph breaks, and (optionally) timestamps. Editors get clean text; IRB committees get a citable record.
  8. 08What about a long oral-history recording — three hours?
    Business plan accepts up to 4 hours per file. A 3-hour oral history typically completes in 25–30 minutes. The transcript is searchable across your account, so a 30-recording archive becomes a single text-searchable corpus.
  9. 09How do I handle off-the-record material in the transcript?
    We don't auto-detect "off the record" — the transcript captures everything spoken. Most journalists handle this manually: tag off-record sections in the transcript with an inline marker, then export a redacted DOCX for editorial review while keeping the full transcript private. The unredacted version stays in your account; only you control what leaves.
  10. 10Can I anonymise speaker names before exporting?
    Yes. Use the speaker-rename feature to set pseudonyms (Source A, Source B, "the regulator"). The export carries the renamed labels, not the originals. Transcripts in your account can be deleted after the project ships if needed for source protection.
  11. 11What if my source uses a strong accent or speaks softly?
    Strong accents are usually fine on tier-1 languages — Glaswegian, Texan, Brazilian Portuguese, French Canadian, all transcribe at ~90%+. Whispered or very-low-volume audio is the genuine challenge: we can't enhance signal that isn't there. Hold the mic close, ask the source to speak at conversational volume, plan a manual review pass for sensitive sections.
  12. 12Can I integrate this with my CAQDAS tool (NVivo, Atlas.ti, MaxQDA)?
    Yes — export DOCX or plain TXT and import into your coding software. The transcript carries speaker labels and timestamps in a format every CAQDAS tool reads. For programmatic ingestion (long studies, batch processing), use the JSON export via API.
Action Start trial

Try it on one interview.

60 free minutes per month, no card. Drop a recording — first transcript with speaker labels in about 10 minutes.

Start free