Transcribe MP4 video to text.Audio extracted automatically.

Drop the MP4 file as-is — we pull the audio track server-side, return a timestamped transcript, and ship an SRT that drops straight back into YouTube, Vimeo, or your NLE.

Drop a file, or pick one

MP3 · WAV · M4A · MP4 · MOV · MKV · OGG · OPUS · FLAC · WEBM — up to 100 MB anonymously

Paste a link, we’ll fetch the audio

YouTube · TikTok · Vimeo · Twitter · SoundCloud · Spotify · 50+ more

Record straight from your browser

Sign up takes 30 seconds — recording opens right after, in the dashboard.

No card required~90s per 60-min fileSRT · VTT · DOCX · TXTFiles auto-deleted in 24h

↓ Watch what comes out

MP4 in. Transcript + SRT out.

MP4 is a container — we read the audio stream directly, never re-encode the video. Timestamps stay frame-accurate to your original timeline, so the SRT lines up on first import.

training-module-04.mp4REC 1080p · 22:14 · 412 MB
auto-detected en-USAAC 48 kHz stereo · 192 kbps
~90s
Transcript · streaming95% accuracy
S1

Alright, in this module we're walking through the refund workflow end-to-end.

S2

Quick question before we start — does this apply to partial refunds too?

S1

Good catch. Partials use the same screen but a different reason code.

S2

Got it. And the approval threshold is still two hundred dollars?

95% on clean dialogSRT · VTT · DOCX · TXT · JSON

↓ This is the dashboard

This is what loads when the job finishes.

Same layout as the real dashboard — Summary, full Transcript, Speakers tab, Exports. Key points and action items extracted automatically. Auto-tags on every job.

Try it on your own file — it's free

Three real options · honest comparison

DIY with ffmpeg. A video editor. Or us.

You can extract the audio yourself and run Whisper. You can drag the MP4 into Descript or VEED and live inside their editor. Or you can drop the file here and get the transcript + SRT back, no editor lock-in.

Option 01

ffmpeg + Whisper

Free, local, fiddly. You own the pipeline and every bug in it.

RequiresCLI + 10 GB model + GPU
Speaker diarizationSeparate tool (pyannote)
SRT outputYes, manual flag
Time on a 1-hour MP420–90 min on CPU
Multi-track audioYou pick the stream
Cost$0 + your hardware
Best forEngineers who already run Whisper locally and don't mind stitching diarization on top.
Option 02

Transcription.Solutions

Drop the MP4. Audio extraction, diarization, SRT, summary — one pass.

RequiresBrowser, that's it
Speaker diarizationBuilt in, every job
SRT outputFrame-aligned to source
Time on a 1-hour MP4~4 min, streamed
Multi-track audioWe list all streams
Cost · per min$0.03
Best forAnyone with an MP4 who wants the text and SRT without learning a video editor or a CLI.
Option 03

Descript / VEED

Load MP4 into the editor. Transcript appears as part of the timeline UI.

RequiresAccount + editor learning curve
Speaker diarizationYes, EN-tuned
SRT outputExport-gated by plan
Upload cap5 GB (Descript free)
Multi-track audioFirst track only
Cost$12–24/user/mo
Best forEditors who want to cut the video and the transcript in the same tool.

Pricing and feature caps approximate as of 2026. Descript and VEED tier names change frequently — check their site for current limits.

Specific to MP4

Three things that bite people on generic transcription tools.

MP4 is a container, not a codec — and most transcription tools treat it like one big audio blob. That's where the misses come from.

What goes wrong

  1. 1Multi-track MP4 with boom + lav. Generic tools grab track 1 and ignore the rest, so you lose the cleaner mic. Common on FCP and Premiere exports.
  2. 2Background music in vlogs and ads triggers phantom words. The recognizer tries to transcribe the vocals on the music bed.
  3. 3SRT timestamps drift when the tool re-encodes the video on the way in. By minute 40 the captions are a second off.

What to flip here

  1. 1Upload — we probe every audio stream and let you pick which one transcribes. Default is the highest-bitrate track.
  2. 2Turn on Music suppression on the job form. We gate the recognizer on speech VAD so instrumental sections stay empty.
  3. 3We never re-encode video. Audio is extracted at native sample rate, timestamps reference the container's edit list — SRT lines up frame-accurate.

Recommended job settings for MP4

Drop an MP4 and these flip on by default. Override per-job from the form.

Audio extraction
Native sample rate, no re-encode
Track selection
Highest-bitrate stream
Diarization
Acoustic · 1-6 speakers
Music suppression
On for vlog/ad presets
SRT format
≤42 chars/line, 2 lines max
Export
SRT · VTT · DOCX · timestamped TXT

Accuracy · real-world numbers

95% on a clean shoot. Honest numbers when the audio fights back.

MP4 accuracy is set by the mic, not the codec. A lav mic on a quiet set beats a 4K camera with on-board audio every time. Numbers below come from real customer MP4s, sorted by what was capturing the audio.

96%+
Studio shoot, lav or shotgun mic

Lapel or boom into a recorder, 48 kHz AAC at 192+ kbps, treated room. The ceiling case. Speaker labels nail it on a two-person shoot.

93%
DSLR with on-camera shotgun

Camera-top mic 2-4 feet from speaker. Some room tone but speech is intelligible. Most YouTube creator footage lands here.

89%
Screen recording with USB mic

OBS, Loom, Camtasia exports. Mic is close but room is untreated, often with system audio bleed. Plenty good for tutorial transcripts.

84%
Phone-shot vlog, internal mic

Built-in phone mic, wind or handling noise, distance varies shot to shot. Words usable, expect 1-2 fixes per minute on proper nouns.

Common questions

8 things people ask about MP4 transcription.

01Do you re-encode my video?+
No. We only read the audio stream out of the MP4 container. The video stream is never touched, never re-encoded, and never stored after the job finishes — you keep your original file unchanged.
02What codecs inside the MP4 are supported?+
Standard H.264 + AAC is the easy case. We also handle HEVC/H.265, ProRes-in-MP4, and audio in MP3, Opus, ALAC, or PCM. If ffmpeg can probe it, we can transcribe it.
03What's the file size cap?+
10 GB per upload on the web uploader, 50 GB via the API with resumable chunks. A typical 1-hour 1080p MP4 is 1-3 GB so most files fit the web path without thinking about it.
04Will the SRT line up with my original video?+
Yes — timestamps reference the MP4's edit list and native sample rate. We don't re-encode, so there's no drift. Drop the SRT next to the MP4 in any player or NLE and captions sync on first load.
05Can I burn the subtitles into the video?+
Not on our side — we output the SRT and leave the burn-in to your editor. ffmpeg one-liner, HandBrake, Premiere, DaVinci, Kapwing all accept the SRT we produce. We don't want to be the encoding tool too.
06What about MOV, MKV, M4V, WebM?+
All supported through the same pipeline. MOV especially — same MPEG-4 family, identical extraction path. MKV with multiple audio tracks gets the same stream-picker UI as a multi-track MP4.
07Can I just send a YouTube or Vimeo URL?+
Yes for YouTube — paste a public URL on the upload screen and we fetch the audio directly, no MP4 download needed. Vimeo requires a direct file or a signed download link because their player gates the stream.
08What if there's no spoken dialog, just music or B-roll?+
VAD detects silent and music-only sections and skips them, so you don't pay for ambient footage. The transcript marks those ranges as `[music]` or `[no speech]` instead of inventing words.

Drop your MP4. Get the transcript and SRT back.

30 free minutes every month. No card. Audio extracted server-side, speaker labels, frame-accurate SRT — all included.

Start free