Whisper local / open source
Free if you have a GPU and an afternoon. No speaker diarization out of the box.
Drop an MP3 file at any bitrate from 64 to 320 kbps. Get a timestamped, speaker-labeled transcript in 99 languages — no format conversion, no re-encoding, no waiting on a queue.
MP3 · WAV · M4A · MP4 · MOV · MKV · OGG · OPUS · FLAC · WEBM — up to 100 MB anonymously
YouTube · TikTok · Vimeo · Twitter · SoundCloud · Spotify · 50+ more
↓ Watch what comes out
We read the MP3 frame headers directly — VBR, CBR, joint-stereo, any encoder (LAME, Fraunhofer, FFmpeg). If the file is true stereo with speakers on separate channels, we use that to split voices. Mono mix-down falls back to acoustic diarization.
So when did you first realise the archive was incomplete?
Probably around 2019, when we started digitising the reel-to-reels.
And the missing tapes — were they catalogued anywhere at all?
There's a paper index from '78, but half of it's water-damaged.
↓ This is the dashboard
Same layout as the real dashboard — Summary, full Transcript, Speakers tab, Exports. Key points and action items extracted automatically. Auto-tags on every job.
Sample preview from a founder interview about post-call workflow. Real transcripts look exactly like this — same tabs, same summary block, same key-points / action-items split, same auto-tag chips.
Three real options · honest comparison
You can run Whisper on your own laptop for free if you're technical. Otter and Sonix accept MP3 uploads inside subscription dashboards. We take the file, return the transcript, and don't make you live inside a UI.
Free if you have a GPU and an afternoon. No speaker diarization out of the box.
Drop the MP3. Get speaker-labeled text back in roughly real-time × 0.025.
Polished dashboard, monthly minutes cap, English-tuned. File upload feels like a side feature.
Pricing and feature availability accurate as of May 2026. Whisper performance varies by model size and hardware.
Specific to MP3
MP3 is a format, not a recording style — which means the failure modes come from the encoder, not the speech.
Defaults that fit ~80% of MP3 files. Override per-job from the form.
Accuracy · real-world numbers
MP3 accuracy is bounded by what the encoder kept, not by us. Perceptual compression above ~96 kbps preserves speech intelligibility very well; below 64 kbps, sibilants and consonants start dissolving. Numbers below are from real customer MP3s in production.
Near-lossless for speech. Podcast masters, dictation app exports, professional interview rigs. Diarization clean if speakers on separate channels.
Most common bitrate for spoken-word MP3s. Zoom exports, Riverside downloads, voice recorders default. Compression artifacts inaudible to recognizer.
Voice memo defaults on most phones. Acoustic diarization handles 2-4 speakers. Numbers and proper nouns occasionally need a glance.
Old answering-machine rips, lecture archives, narrow-band sources. High-frequency consonants (f/s/sh) blur. Still legible — plan a proofread.
Common questions
30 free minutes every month. No card required. Speaker labels, 99 languages, every export format included.
Start free