Transcribe MP3 to text.Speaker labels, 100+ languages.

Drop an MP3 file at any bitrate from 64 to 320 kbps. Get a timestamped, speaker-labeled transcript in 99 languages — no format conversion, no re-encoding, no waiting on a queue.

Drop a file, or pick one

MP3 · WAV · M4A · MP4 · MOV · MKV · OGG · OPUS · FLAC · WEBM — up to 100 MB anonymously

Paste a link, we’ll fetch the audio

YouTube · TikTok · Vimeo · Twitter · SoundCloud · Spotify · 50+ more

Record straight from your browser

Sign up takes 30 seconds — recording opens right after, in the dashboard.

No card required~90s per 60-min fileSRT · VTT · DOCX · TXTFiles auto-deleted in 24h

↓ Watch what comes out

MP3 in. Diarized transcript out.

We read the MP3 frame headers directly — VBR, CBR, joint-stereo, any encoder (LAME, Fraunhofer, FFmpeg). If the file is true stereo with speakers on separate channels, we use that to split voices. Mono mix-down falls back to acoustic diarization.

interview-tape-04.mp3REC 192 kbps · stereo · 38:42
auto-detected en-GB44.1 kHz · LAME 3.100
~90s
Transcript · streaming95% accuracy
S1

So when did you first realise the archive was incomplete?

S2

Probably around 2019, when we started digitising the reel-to-reels.

S1

And the missing tapes — were they catalogued anywhere at all?

S2

There's a paper index from '78, but half of it's water-damaged.

95% on 192 kbps stereoSRT · DOCX · TXT · JSON · VTT

↓ This is the dashboard

This is what loads when the job finishes.

Same layout as the real dashboard — Summary, full Transcript, Speakers tab, Exports. Key points and action items extracted automatically. Auto-tags on every job.

Try it on your own file — it's free

Three real options · honest comparison

Free local Whisper. Otter or Sonix. Or us.

You can run Whisper on your own laptop for free if you're technical. Otter and Sonix accept MP3 uploads inside subscription dashboards. We take the file, return the transcript, and don't make you live inside a UI.

Option 01

Whisper local / open source

Free if you have a GPU and an afternoon. No speaker diarization out of the box.

SetupPython + CUDA + 10 GB models
Speaker diarizationNot included (pyannote add-on)
Speed · 1 hr MP35–40 min on consumer GPU
Languages99, but tiny model drops below 80%
ExportTXT / SRT / VTT / JSON
CostFree + your electricity
Best forEngineers who already own a GPU, don't need speaker labels, and want full local privacy.
Option 02

Transcription.Solutions

Drop the MP3. Get speaker-labeled text back in roughly real-time × 0.025.

SetupDrag-and-drop, no account needed to try
Speaker diarizationBuilt in (Pro & Business plans)
Speed · 1 hr MP3~90 seconds
Languages99, auto-detected
ExportSRT · VTT · DOCX · TXT · JSON
Cost · per min$0.03
Best forAnyone with an MP3 — journalist tape, podcast export, voice memo, archival dub — who just wants accurate text out the other end.
Option 03

Otter / Sonix

Polished dashboard, monthly minutes cap, English-tuned. File upload feels like a side feature.

SetupAccount + paid plan
Speaker diarizationAcoustic, EN-leaning
Speed · 1 hr MP35–10 min in queue
LanguagesOtter EN-only; Sonix ~40
ExportLocked behind paid tiers
Cost$17+/mo or $10+/hr (Sonix)
Best forTeams that want a transcript editor and collaboration UI more than a clean API-style file→text flow.

Pricing and feature availability accurate as of May 2026. Whisper performance varies by model size and hardware.

Specific to MP3

Three things that bite people on generic transcription tools.

MP3 is a format, not a recording style — which means the failure modes come from the encoder, not the speech.

What goes wrong

  1. 1VBR headers get mis-parsed. Some tools read variable-bitrate MP3s as fixed-rate and miscalculate duration — timestamps drift by minutes over an hour-long file.
  2. 2Joint-stereo gets flattened to mono during upload preprocessing. You lose the per-speaker channel separation that was actually in the file.
  3. 3Embedded ID3 album art trips a few uploaders — they reject the file as 'not pure audio' or strip it and re-encode, dropping quality further.

What we do instead

  1. 1We use the Xing/LAME header when present and frame-count fallback when not. VBR timestamps stay accurate to ±0.1 s across multi-hour files.
  2. 2Joint-stereo and true-stereo MP3s are decoded to L/R PCM before diarization. If your speakers were panned, we keep them split.
  3. 3ID3v1, ID3v2, APE tags, embedded art — all passed through untouched. We never re-encode your MP3.

Recommended job settings for MP3 uploads

Defaults that fit ~80% of MP3 files. Override per-job from the form.

Decoder
Frame-accurate, no re-encode
Diarization
Channel split if stereo, else acoustic
Speaker model
Auto · 1-12 speakers
Language
Auto-detect from first 30 s
Filler words
Removed (toggle to keep)
Export bundle
DOCX + SRT + timestamped TXT

Accuracy · real-world numbers

95%+ on 192 kbps stereo. Usable down to 64 kbps mono.

MP3 accuracy is bounded by what the encoder kept, not by us. Perceptual compression above ~96 kbps preserves speech intelligibility very well; below 64 kbps, sibilants and consonants start dissolving. Numbers below are from real customer MP3s in production.

96%
320 kbps stereo, studio source

Near-lossless for speech. Podcast masters, dictation app exports, professional interview rigs. Diarization clean if speakers on separate channels.

95%
192 kbps stereo, 2-3 speakers

Most common bitrate for spoken-word MP3s. Zoom exports, Riverside downloads, voice recorders default. Compression artifacts inaudible to recognizer.

91%
128 kbps mono, conversational

Voice memo defaults on most phones. Acoustic diarization handles 2-4 speakers. Numbers and proper nouns occasionally need a glance.

84%
64 kbps mono, archival / phone-dump

Old answering-machine rips, lecture archives, narrow-band sources. High-frequency consonants (f/s/sh) blur. Still legible — plan a proofread.

Common questions

8 things people ask about MP3 transcription.

01What's the minimum MP3 bitrate that still gives a usable transcript?+
64 kbps is the practical floor. Below that, sibilants (s, sh, f) compress into noise and word error rate climbs past 20%. If you're recording fresh, target 128 kbps mono or 192 kbps stereo — anything higher is overkill for speech.
02Do I need to convert my MP3 to WAV first?+
No. Re-encoding MP3 → WAV adds zero accuracy because the data the encoder discarded is gone for good. Upload the MP3 directly. We decode frames in memory and feed PCM to the recognizer.
03Will stereo MP3 give me better speaker labels than mono?+
Only if the speakers were actually recorded on separate channels — most stereo MP3s have the same audio on both sides ('dual mono') and gain nothing. True channel-split (e.g. Riverside exports, two-mic field rigs) lets us skip acoustic diarization and label speakers near-perfectly.
04What's the maximum MP3 file size you accept?+
5 GB per upload, which is roughly 60 hours at 192 kbps or 90 hours at 128 kbps. If your file is larger we'll show a chunked upload — no need to split it yourself.
05How long does a 60-minute MP3 take to transcribe?+
Typically 90 seconds from upload-complete to transcript-ready, regardless of bitrate. Decoding MP3 frames is fast; the time is in the recognizer. Diarization adds 5-10 seconds on multi-speaker files.
06My MP3 has background music — will the transcript be ruined?+
Quiet bed music under speech is fine. Loud music that competes with the voice (intro stings, scoring under interviews) sometimes triggers misrecognitions on overlapping syllables. Toggle music suppression on the job form to pre-filter.
07Can you handle MP3s ripped from phone voicemail or answering machines?+
Yes, though these are often 8 kHz narrow-band re-encoded as MP3 — the audio quality ceiling is set by the original PSTN capture, not the MP3 wrapper. Expect 78-85% accuracy on that kind of source, which is the same we'd get on the underlying call.
08Do you keep my MP3 after the transcript is done?+
Files are deleted after 30 days by default, or immediately on request via the dashboard. The transcript stays in your account until you delete it. We don't use customer audio to train any model — ever.

Drop your MP3. Get text back in 90 seconds.

30 free minutes every month. No card required. Speaker labels, 99 languages, every export format included.

Start free