Drop an MP4, MOV, or MKV — or paste a YouTube, TikTok, Vimeo, or Twitter URL. Get a clean transcript, SRT and VTT subtitle exports, optional AI summary, and timestamps you can quote. We handle the audio extraction.
Video-to-text turns the spoken audio in a video file into searchable, timestamped text. Transcription.Solutions handles the audio extraction in your browser: drop an MP4, MOV, MKV, AVI, or WEBM file — or paste a URL from YouTube, TikTok, Instagram Reels, Vimeo, Twitter, Facebook, Twitch, Reddit, or any of around 1,500 other sites yt-dlp supports. Within minutes you get a clean transcript with speaker labels, optional AI summary, and exports tailored to video — SRT and VTT subtitle files alongside DOCX and plain TXT.
Three steps from video to text. We extract the audio, transcribe it, and format the result for the export you actually need — subtitle file, transcript, or summary.
Drag in a file (MP4 / MOV / MKV / AVI / WMV / WEBM) up to 2 GB on Business, 500 MB on Pro, 100 MB on Free. Or paste a public URL — YouTube, TikTok, Vimeo, Twitter, Facebook, Twitch, Reddit, Dailymotion, BBC iPlayer, podcast feed. We extract the audio track on our servers; you don't need to convert anything.
We split long files into chunks, run speaker diarization on Pro and Business, and align word-level timestamps so subtitles cue accurately. Language is auto-detected. Music-heavy intros and outros are skipped from the transcript when possible.
SRT for video editors and YouTube uploads. VTT for HTML5 players. DOCX for sharing with reviewers. Plain TXT for downstream tooling. Optional AI summary with key points and timestamps for navigation.
Word-level alignment so cues match what's actually being said. No 4-line wraps, no overlap. Optional CPS cap (37 by default) for broadcast pacing.
Two or more voices separated and labelled (Speaker 1, Speaker 2). Manual rename per speaker. Available on Pro and Business — required for interview videos and panel discussions.
Sentence boundaries, capitalisation, punctuation. Filler words can be filtered on export. Timestamps every 30 seconds or per speaker turn.
Key points, decisions, and action items extracted from the transcript, each linked to the moment in the video where they happened.
Find a quote across hundreds of videos in seconds. Click the result and the player jumps to that moment.
POST a video URL or file, GET back transcript + SRT + VTT via webhook. JWT auth, per-key rate limits. Available on every plan including Free for evaluation.
We resolve the URL, extract the audio, and transcribe it. Public videos only — anything behind a login (private accounts, member-only content, paid streaming services with DRM) won't resolve. If you can play the video in an incognito window without signing in, we can almost always transcribe it.
All formats below are accepted directly — we extract the audio track on our servers using ffmpeg. Maximum file size depends on plan: 100 MB Free, 500 MB Pro, 2 GB Business.
| No. | Format | Extension | Bitrate range | Notes |
|---|---|---|---|---|
| 01 | MP4 | .mp4 | any | The default for everything: phone recordings, screen captures, downloaded YouTube videos. H.264 / H.265 / AV1 — we strip the audio regardless. |
| 02 | MOV | .mov | any | QuickTime container, common for Mac screen recordings and pro cameras. Same audio handling as MP4. |
| 03 | MKV | .mkv | any | Open container popular for high-quality archives. Multiple audio tracks — we use the default track unless you tell us otherwise. |
| 04 | AVI / WMV | .avi / .wmv | any | Older Windows formats. Still common for archival footage. Accepted. |
| 05 | WEBM | .webm | any | Browser-recorded video (Google Meet exports, OBS web recordings). Direct upload, no re-encoding. |
| 06 | Audio-only | .mp3 / .wav / .m4a / .ogg / .opus / .flac | any | If you've already extracted the audio, upload it directly. See the audio-to-text reference for full details. |
99 languages with automatic detection. Studio-grade accuracy on the 8 cobalt-marked tier-1 languages, production-grade on tier-2, usable on tier-3. Each tile is a dedicated landing page with benchmarks for that language.
Pick the scenario closest to yours. Each tile leads to a workflow guide with recommended export format, accuracy expectations, and the gotchas specific to that kind of video.
Video audio quality varies more than people expect — a YouTube tutorial filmed with a USB mic plays differently than a phone-recorded street interview. Here's what to expect honestly.
USB mic or shotgun mic, one to two speakers, occasional background room tone. The result most podcast producers, YouTubers, and conference organisers see on their typical files.
Phone-recorded street interviews, multi-speaker panels, light background music in the intro. Most words are right; a single editorial pass catches the rest.
If a song's vocals are louder than the speaker, accuracy drops fast. We try to skip music-only intros and outros, but mid-clip music beds with overlapping vocals are hard. Solution: lower the music in your edit, or transcribe an alternate audio mix if you have one.
Three or more speakers talking over each other — typical of livestream debates and panel discussions — pushes accuracy below 90%. Diarization may merge two voices it can't separate. Plan a single editorial pass.
If a video contains call audio (someone joining a meeting from a phone), that segment will transcribe at phone-call accuracy — typically 80–85%. The rest of the video is unaffected.
We don't burn subtitles into the video file. We deliver SRT and VTT files you upload alongside the video (YouTube, Vimeo, your CMS). For burned-in subtitles, take our SRT to a video editor.
“I do podcast clips for LinkedIn. Used to spend an hour on each three-minute clip — the SRT, the punctuated post copy, the timestamps. Now I paste the YouTube URL, walk away, and come back to all three. The SRT alignment is the part I expected would be wrong; it isn't.”
60 free minutes per month, no card required. Paste a YouTube URL or upload an MP4 — first transcript and SRT in 5 minutes.
Start free