Journalists, UX researchers, oral historians, and lawyers all transcribe interviews — but each starts from a different recording. This page maps the eight common recording methods to their accuracy expectations, the prep step each needs, and the citation discipline that follows.
Interview transcription converts a recorded conversation between two or more speakers into speaker-labelled, timestamped, citable text. The transcript pipeline is one — but the prep is different for each recording method: a phone voice memo needs nothing, a Zoom recording needs to be downloaded as MP4 first, a handheld field recorder dump needs the SD card extracted. The decision tree below maps the eight common methods to the right path. Transcription.Solutions returns speakers separated and labelled, timestamps per turn for citation, optional AI summary with key-quote extraction, and exports tailored to research workflows — DOCX for editorial review, plain TXT for downstream tooling, JSON via API. The source audio is permanently deleted from our infrastructure within 24 hours of completion; we do not train models on your data.
Different recording methods have different accuracy ceilings and different prep work. Find your method below for what to expect and what to do before upload.
Two-person interview: clean separation. Three-person interview (e.g. UX researcher + two participants): usually clean. Manual rename once and every turn updates. Crucial for citation discipline.
Each speaker turn carries its own timestamp. "Source said X at 14:22" links back to the exact moment for verification. Click the timestamp to scrub the audio there. Standard for fact-checked journalism.
On Pro and Business, the AI summary extracts the most-quotable lines with their timestamps. A 90-minute interview produces a list of 8–15 citable quotes you'd otherwise re-listen for. Saves the second listen pass.
If you do a multi-interview project (a UX research round, a long-form investigation), search hits the words inside every transcript. "Did anyone mention X?" goes from re-listening to 5 seconds.
Clean Microsoft Word export with speakers, timestamps, and paragraph breaks. The format every editor and IRB committee expects. Or plain TXT if you prefer to massage in your own pipeline.
Source audio is deleted from our infrastructure within 24 hours after transcription completes. Transcripts stay in your account until you delete them. We don't train models on your data; the upstream ASR provider operates under no-training paid endpoints.
Diarization quality and word accuracy depend on the recording. The methods below are listed roughly in descending audio quality. None of them are "too low quality to bother with" — they each have a sensible accuracy expectation.
Interviews are some of the most demanding audio for ASR — multiple speakers, varied recording quality, accents, sometimes deliberate quietness for sensitive topics. Honest expectations below.
USB mic, handheld recorder, or Voice Memos in a quiet room. Two clearly-recorded speakers. Diarization is excellent; word accuracy is reliably above the threshold for fact-check quoting.
Phone recordings, conference-room ceiling mics, three-person panels, slight background noise. Speaker labels still work; word accuracy lands here. Plan a single editorial pass before quoting.
We can't tell when an interviewee says "this part is off the record" — the transcript captures everything. Plan how you handle off-the-record material before recording. Many journalists keep a separate redacted DOCX.
Three or more interviewees talking over each other — common in family-style oral histories, group focus sessions — pushes accuracy below 85%. Diarization may merge two voices it can't separate. Mark these as priority for a manual review pass.
We process the audio on our servers and delete it within 24 hours. We don't train models on your data. But for HIPAA-regulated medical interviews or legal-privileged depositions, talk to your compliance team before uploading. Our standard contract works for most journalism / research / UX use; specialised contexts need their own review.
Interviews recorded at very low volume (whispered, distance-recorded, or capturing a hesitant source) drop to 70–80% accuracy regardless of language. We can't enhance the signal that isn't there. Plan a careful manual pass.
Each interview transcribed within an hour of recording — same-day, while the conversation was still fresh. Renamed Speaker 2 to the source's pseudonym in the speaker library.
Used inline tags ("#regulator", "#anonymous", "#offrecord") to mark themes and citation status. Off-record sections were tagged for redaction in the published piece but kept in the transcript for context.
When the editor flagged a claim during fact-check, opened the dashboard search, typed the keyword, jumped to the moment in the audio. The verification cycle shrank from "come back tomorrow" to "five seconds".
Every quote in the published piece carried an internal footnote: source pseudonym + recording number + timestamp. Editor's fact-check pass took two days instead of two weeks.
Time to verify a quote during fact-check, down from "come back tomorrow". The single change that turned transcripts from a debt into an asset across a six-month investigation.
If the decision tree didn't cover your recording, the FAQ below catches the long tail. Twelve concrete questions from working journalists, researchers, and lawyers — answered honestly.
60 free minutes per month, no card. Drop a recording — first transcript with speaker labels in about 10 minutes.
Start free