ReferenceVideo to text in plain English — what it does, what file types work, what to expect.
SubjectVideo-to-Text Conversion
SourcesMP4, MOV, MKV, AVI + 1,500 URL sources
OutputTranscript · SRT · VTT · DOCX
PrivacySource video deleted in 24h

Video to text.
With speaker labels and subtitle files.

Drop an MP4, MOV, or MKV — or paste a YouTube, TikTok, Vimeo, or Twitter URL. Get a clean transcript, SRT and VTT subtitle exports, optional AI summary, and timestamps you can quote. We handle the audio extraction.

No card required· Generates SRT + VTT· Speaker labels on Pro· Source deleted in 24 hours
Faster than realtime
A 60-min video typically completes in 9–11 minutes. Long files split into chunks and process in parallel.
Subtitle formats
SRT
Plus VTT for web players. Optional CPS (characters-per-second) caps for broadcast-grade pacing.
File size
2 GB
Upload limit on Business; 500 MB on Pro; 100 MB on Free. Or paste any URL — we extract the audio server-side.
DefinitionReference passage

Video-to-text turns the spoken audio in a video file into searchable, timestamped text. Transcription.Solutions handles the audio extraction in your browser: drop an MP4, MOV, MKV, AVI, or WEBM file — or paste a URL from YouTube, TikTok, Instagram Reels, Vimeo, Twitter, Facebook, Twitch, Reddit, or any of around 1,500 other sites yt-dlp supports. Within minutes you get a clean transcript with speaker labels, optional AI summary, and exports tailored to video — SRT and VTT subtitle files alongside DOCX and plain TXT.

WorkflowThree-step procedure

How it works

Three steps from video to text. We extract the audio, transcribe it, and format the result for the export you actually need — subtitle file, transcript, or summary.

1

Upload the video or paste a URL

Drag in a file (MP4 / MOV / MKV / AVI / WMV / WEBM) up to 2 GB on Business, 500 MB on Pro, 100 MB on Free. Or paste a public URL — YouTube, TikTok, Vimeo, Twitter, Facebook, Twitch, Reddit, Dailymotion, BBC iPlayer, podcast feed. We extract the audio track on our servers; you don't need to convert anything.

2

Automatic transcription with timestamps

We split long files into chunks, run speaker diarization on Pro and Business, and align word-level timestamps so subtitles cue accurately. Language is auto-detected. Music-heavy intros and outros are skipped from the transcript when possible.

3

Export SRT, VTT, transcript, or summary

SRT for video editors and YouTube uploads. VTT for HTML5 players. DOCX for sharing with reviewers. Plain TXT for downstream tooling. Optional AI summary with key points and timestamps for navigation.

Output6 deliverable elements

What you get

01

SRT and VTT subtitle files

Word-level alignment so cues match what's actually being said. No 4-line wraps, no overlap. Optional CPS cap (37 by default) for broadcast pacing.

02

Speaker labels

Two or more voices separated and labelled (Speaker 1, Speaker 2). Manual rename per speaker. Available on Pro and Business — required for interview videos and panel discussions.

03

Clean punctuated transcript

Sentence boundaries, capitalisation, punctuation. Filler words can be filtered on export. Timestamps every 30 seconds or per speaker turn.

04

AI summary with timestamps

Key points, decisions, and action items extracted from the transcript, each linked to the moment in the video where they happened.

05

Searchable across your library

Find a quote across hundreds of videos in seconds. Click the result and the player jumps to that moment.

06

REST API for video pipelines

POST a video URL or file, GET back transcript + SRT + VTT via webhook. JWT auth, per-key rate limits. Available on every plan including Free for evaluation.

5.0 / SourcesWhere the video can come from

Paste a URL from anywhere on the web

We resolve the URL, extract the audio, and transcribe it. Public videos only — anything behind a login (private accounts, member-only content, paid streaming services with DRM) won't resolve. If you can play the video in an incognito window without signing in, we can almost always transcribe it.

File formatsFor when you have the file already

Common video file types

All formats below are accepted directly — we extract the audio track on our servers using ffmpeg. Maximum file size depends on plan: 100 MB Free, 500 MB Pro, 2 GB Business.

No.FormatExtensionBitrate rangeNotes
01MP4.mp4anyThe default for everything: phone recordings, screen captures, downloaded YouTube videos. H.264 / H.265 / AV1 — we strip the audio regardless.
02MOV.movanyQuickTime container, common for Mac screen recordings and pro cameras. Same audio handling as MP4.
03MKV.mkvanyOpen container popular for high-quality archives. Multiple audio tracks — we use the default track unless you tell us otherwise.
04AVI / WMV.avi / .wmvanyOlder Windows formats. Still common for archival footage. Accepted.
05WEBM.webmanyBrowser-recorded video (Google Meet exports, OBS web recordings). Direct upload, no re-encoding.
06Audio-only.mp3 / .wav / .m4a / .ogg / .opus / .flacanyIf you've already extracted the audio, upload it directly. See the audio-to-text reference for full details.
QualityWhat to expect, honestly

Accuracy that holds up on real video

Video audio quality varies more than people expect — a YouTube tutorial filmed with a USB mic plays differently than a phone-recorded street interview. Here's what to expect honestly.

95%+
On video where the speech is clearly audible — single or two-speaker, decent mic, no overlapping music. This is the headline number, and it's what most users see on a typical YouTube upload, podcast video, or screen recording.
What we deliver
95%+

Clearly recorded video.

USB mic or shotgun mic, one to two speakers, occasional background room tone. The result most podcast producers, YouTubers, and conference organisers see on their typical files.

  • YouTube videos shot with a podcast-style mic setup
  • Zoom and Google Meet recordings (downloaded, then uploaded here)
  • Screen recordings with a separate audio track
  • Conference talks captured by the venue's PA system
What's normal
90%+

Real-world video.

Phone-recorded street interviews, multi-speaker panels, light background music in the intro. Most words are right; a single editorial pass catches the rest.

  • Phone-recorded vlogs and field interviews
  • Documentary B-roll with ambient sound
  • Multi-speaker panel discussions
  • TikTok clips with a music bed
What affects accuracy on video specifically

Music drowning the speech

If a song's vocals are louder than the speaker, accuracy drops fast. We try to skip music-only intros and outros, but mid-clip music beds with overlapping vocals are hard. Solution: lower the music in your edit, or transcribe an alternate audio mix if you have one.

Crosstalk in interview videos

Three or more speakers talking over each other — typical of livestream debates and panel discussions — pushes accuracy below 90%. Diarization may merge two voices it can't separate. Plan a single editorial pass.

Phone audio (8 kHz) inside a video

If a video contains call audio (someone joining a meeting from a phone), that segment will transcribe at phone-call accuracy — typically 80–85%. The rest of the video is unaffected.

What we do NOT do

We don't burn subtitles into the video file. We deliver SRT and VTT files you upload alongside the video (YouTube, Vimeo, your CMS). For burned-in subtitles, take our SRT to a video editor.

— Why video creators stay Note 002 / 2026

I do podcast clips for LinkedIn. Used to spend an hour on each three-minute clip — the SRT, the punctuated post copy, the timestamps. Now I paste the YouTube URL, walk away, and come back to all three. The SRT alignment is the part I expected would be wrong; it isn't.

ReferenceCommon questions

Frequently asked questions

  1. 01Does it generate subtitle files?
    Yes — both SRT (for video editors and YouTube uploads) and VTT (for HTML5 players). Word-level alignment so cues match the speech. Optional CPS (characters-per-second) cap of 37 for broadcast-grade pacing; turn it off if you prefer raw timing.
  2. 02Will it burn the subtitles into my video file?
    No. We return SRT and VTT files separately. To get burned-in subtitles, take our SRT to a video editor (Premiere, Final Cut, DaVinci, or ffmpeg). YouTube and Vimeo accept the SRT as-is — upload it alongside your video and viewers can toggle captions.
  3. 03What video file formats are supported?
    MP4, MOV, MKV, AVI, WMV, and WEBM. Maximum file size: 100 MB on Free, 500 MB on Pro, 2 GB on Business. Maximum duration: 30 min Free, 60 min Pro, 4 h Business. We extract the audio with ffmpeg server-side, so there's no need to convert before upload.
  4. 04Can I paste a YouTube or Vimeo link directly?
    Yes. Paste any public URL and we'll resolve the video, extract the audio, and transcribe it. Around 1,500 sites work — YouTube, TikTok, Instagram Reels, Vimeo, Twitter, Facebook, Twitch, Reddit, Dailymotion, BBC iPlayer, and the long tail. Login-required content (private accounts, paid streaming) won't resolve.
  5. 05How fast is it?
    Roughly 6× realtime on a single chunk. A 30-minute video typically takes 4–6 minutes; a 60-minute video, 9–11 minutes; a 4-hour talk, around 35 minutes. Long files split into chunks and process in parallel.
  6. 06Does it transcribe foreign-language video?
    Yes. 99 languages with automatic detection — Spanish, French, German, Mandarin, Japanese, Hindi, Arabic, and many more. The 28 languages in our coverage cloud are the ones where we deliver studio-grade or production-grade accuracy. Mixed-language videos (e.g. a Spanish-English interview) typically transcribe well in the dominant language.
  7. 07Are speaker labels included?
    Yes, on Pro ($19/month) and Business ($49/month) plans. Speaker 1, Speaker 2, etc., with manual rename per speaker. Diarization quality depends on audio: clearly separated voices work best. Three or more overlapping speakers are harder.
  8. 08Can I get an API for video pipelines?
    Yes. POST a video URL or upload a file, GET back the transcript plus SRT and VTT via webhook. JWT auth, per-key rate limits, signed callbacks. Available on every plan including Free for evaluation. See /docs/api for the endpoint reference.
Action Start trial

Try it on a real video.

60 free minutes per month, no card required. Paste a YouTube URL or upload an MP4 — first transcript and SRT in 5 minutes.

Start free