ffmpeg + Whisper
Free, local, fiddly. You own the pipeline and every bug in it.
Drop the MP4 file as-is — we pull the audio track server-side, return a timestamped transcript, and ship an SRT that drops straight back into YouTube, Vimeo, or your NLE.
MP3 · WAV · M4A · MP4 · MOV · MKV · OGG · OPUS · FLAC · WEBM — up to 100 MB anonymously
YouTube · TikTok · Vimeo · Twitter · SoundCloud · Spotify · 50+ more
↓ Watch what comes out
MP4 is a container — we read the audio stream directly, never re-encode the video. Timestamps stay frame-accurate to your original timeline, so the SRT lines up on first import.
Alright, in this module we're walking through the refund workflow end-to-end.
Quick question before we start — does this apply to partial refunds too?
Good catch. Partials use the same screen but a different reason code.
Got it. And the approval threshold is still two hundred dollars?
↓ This is the dashboard
Same layout as the real dashboard — Summary, full Transcript, Speakers tab, Exports. Key points and action items extracted automatically. Auto-tags on every job.
Sample preview from a founder interview about post-call workflow. Real transcripts look exactly like this — same tabs, same summary block, same key-points / action-items split, same auto-tag chips.
Three real options · honest comparison
You can extract the audio yourself and run Whisper. You can drag the MP4 into Descript or VEED and live inside their editor. Or you can drop the file here and get the transcript + SRT back, no editor lock-in.
Free, local, fiddly. You own the pipeline and every bug in it.
Drop the MP4. Audio extraction, diarization, SRT, summary — one pass.
Load MP4 into the editor. Transcript appears as part of the timeline UI.
Pricing and feature caps approximate as of 2026. Descript and VEED tier names change frequently — check their site for current limits.
Specific to MP4
MP4 is a container, not a codec — and most transcription tools treat it like one big audio blob. That's where the misses come from.
Drop an MP4 and these flip on by default. Override per-job from the form.
Accuracy · real-world numbers
MP4 accuracy is set by the mic, not the codec. A lav mic on a quiet set beats a 4K camera with on-board audio every time. Numbers below come from real customer MP4s, sorted by what was capturing the audio.
Lapel or boom into a recorder, 48 kHz AAC at 192+ kbps, treated room. The ceiling case. Speaker labels nail it on a two-person shoot.
Camera-top mic 2-4 feet from speaker. Some room tone but speech is intelligible. Most YouTube creator footage lands here.
OBS, Loom, Camtasia exports. Mic is close but room is untreated, often with system audio bleed. Plenty good for tutorial transcripts.
Built-in phone mic, wind or handling noise, distance varies shot to shot. Words usable, expect 1-2 fixes per minute on proper nouns.
Common questions
30 free minutes every month. No card. Audio extracted server-side, speaker labels, frame-accurate SRT — all included.
Start free