How to turn one YouTube upload into captions, chapters, and a blog draft
A YouTube transcript for creators is the raw material for four assets you ship anyway: an SRT for the captions track, a VTT for an embedded HTML5 player, a DOCX you edit into a blog post, and a topic-segmented summary you paste into the description as chapters. Paste the video URL, wait 9-11 minutes for a 60-minute video, download all four. The chapter generation works on talk-driven videos; tutorial and demo videos need a manual pass.
What you get back from a single YouTube URL
Paste the link into the YouTube transcription workflow and the job picks up the same file YouTube serves — no download, no re-encode on your end. About 9-11 minutes later for a 60-minute video, the export panel has five formats: DOCX, SRT, VTT, TXT, JSON.
Each one has a job:
- SRT — upload directly to YouTube Studio under Subtitles → Add language → Upload file. Replaces the auto-captions.
- VTT — drop alongside the video on your own site for the HTML5
<track>element. Same timings as SRT, different container. - DOCX — speaker-labelled, paragraph-broken transcript. The blog draft.
- TXT — flat text, no timestamps. For pasting into a newsletter or notes app.
- JSON — word-level timestamps and speaker turns. For anyone scripting custom workflows over the REST API.
Five formats, one job, same on every plan including the free 30 minutes.
Captions: SRT for YouTube, VTT for your own player
YouTube's auto-captions are fine for English monologue at studio bitrate. They drift on accents, on overlapping speech, and on technical vocabulary. The uploaded SRT replaces them and shows up as the default "English" track instead of "English (auto-generated)" — viewers notice the difference.
The SRT we generate is line-broken at ~42 characters with 2-line blocks, which is the YouTube-friendly default. If you want a single track for video files served from your own CMS, use the VTT — it's the same timings wrapped in WebVTT instead of SubRip. Both formats survive being re-uploaded after a minor edit; you don't have to regenerate from the audio.
On a 128 kbps or better YouTube source, expect ~92% word accuracy on real-world podcast-style audio. Plan to skim the SRT for proper nouns and brand names before publishing — that's where the last 8% lives.
Chapters: honest about what the topic summary can and cannot do
The summary that comes with each job is topic-segmented — a list of chapter-style headings with the timestamp where each topic starts. On a talk-driven video (interview, podcast feed, monologue, panel), you can paste it into the YouTube description with minimal cleanup and the chapters appear under the scrubber.
It works because the model is looking at what is being said, and on talk-driven video that's a faithful proxy for what the video is about.
It works less well when:
- The video is a screen recording or tutorial. The transitions that matter ("now open the settings panel") are visual. The transcript says "and now we click here" three times and the chapter boundaries don't align with the actual UI sections.
- The video is a product demo with long silent stretches. The summary will compress the silent parts and over-weight the spoken intro.
- The host narrates over B-roll. The chapter boundaries follow the narration, not the visual cuts.
For those cases, the topic summary is a starting point — you'll move 2-4 timestamps by hand. For interviews, podcast video, and talking-head explainers, it's usually publishable as-is. We don't analyse on-screen content; the chapters come from the transcript topics only.
Blog draft: the DOCX, and what to do with it
The DOCX export is a speaker-labelled, paragraph-broken document. Not a finished blog post. The work that's already done: paragraph breaks at natural pauses, speaker labels (renameable), punctuation, capitalisation, filler-word reduction.
What you do in 20-30 minutes of editing:
- Pick a thesis. The video probably has three. Pick one.
- Cut the chronological structure. Blog readers don't watch in order.
- Pull 2-3 direct quotes out as pull-quotes. The rest you paraphrase.
- Write a new intro that isn't "Hey everyone, welcome back." The video opening doesn't translate.
- Add the YouTube embed at the top and keep the post as the "extended notes" version.
A solo creator doing weekly long-form video typically gets a 1,200-word post out of a 45-minute interview this way. The DOCX gives you the structure; you cut roughly 70% and rewrite the rest.
Workflow for a typical weekly upload
For a creator publishing one 30-60 minute video per week, the numbers work out like this:
| Step | Time |
|---|---|
| Paste YouTube URL, kick off job | 30 seconds |
| Job runs (60-min video) | 9-11 minutes |
| Skim SRT, fix proper nouns, upload to YouTube Studio | 5-10 minutes |
| Paste topic summary into description as chapters | 2-5 minutes |
| Edit DOCX into blog post | 20-40 minutes |
About 35-55 minutes of human work per upload, versus the 2-3 hours most creators spend doing the same four things manually. The Pro plan at $19/month covers 600 minutes — ten 60-minute videos a month with room to re-run a botched job.
Source audio is permanently deleted from infrastructure within 24 hours of job completion. Transcripts stay in your account until you delete them. We do not train models on your data.
FAQ
Can I upload an SRT file generated here directly to YouTube Studio?
Yes. The SRT export uses YouTube-compatible line breaks (≤42 characters, max 2 lines per cue) and standard SubRip timing. Upload it under Subtitles → Add language → English → Upload file → With timing. It replaces the auto-generated captions and shows in the player as a publisher-uploaded track. You can edit the SRT in any text editor before uploading — the format is plain text.
Do the chapters actually work for tutorial and screen-recording videos?
Less reliably than for talk-driven videos. The topic summary is generated from the transcript only — we don't analyse on-screen content. For tutorials where the meaningful transitions are visual ("now open the dashboard"), the chapter boundaries from the transcript often miss by 30-60 seconds. Treat the summary as a draft and adjust 2-4 timestamps manually. For interviews, podcasts, monologue, and panel video, it's usually publishable without editing.
How accurate is the transcript on YouTube audio?
On YouTube uploads with source quality at 128 kbps or higher — which covers most modern uploads — expect ~92% word accuracy on real-world podcast-style audio. Studio-grade material edges toward 95%. Older uploads at lower bitrates, heavy accents, multiple overlapping speakers, or background music drop accuracy to 85-88%. The remaining errors cluster around proper nouns, brand names, and technical vocabulary — the parts worth a 5-minute skim before publishing.
What about videos with multiple speakers — interviews, panels?
Speaker diarization is on by default for interview workflows. Labels come back as Speaker 1, Speaker 2, Speaker 3, which you rename once in the dashboard — the rename propagates to the DOCX export. For stereo files with one speaker per channel, we use channel split (100% accurate). For mono mixed audio — which is what YouTube serves — we run pyannote diarization, which is reliable for 2-4 distinct speakers but degrades on heavy overlap.
Can I send the YouTube URL via API instead of pasting it?
Yes. The REST API accepts a URL field with any of 1,500+ supported sources, including YouTube. You POST the URL, get a job ID, and receive a signed webhook callback when the transcript is ready. The response includes the same five export formats. JWT auth, available on every plan including free. This is how creators with a publishing pipeline wire transcription into their upload flow.
What's the per-video cost on a weekly creator schedule?
On the $19/month Pro plan, 600 minutes covers ten 60-minute videos with overage at $0.04/minute on Pro and $0.02/minute on Business. A creator publishing one 45-minute video per week uses ~180 minutes/month — well inside quota. If you're publishing daily short-form, the Business plan at $49/month for 2,500 minutes is the next step. The free tier (30 minutes/month, no card) is enough to transcribe one short upload as an evaluation.
Does this work for languages other than English?
Yes — 99 languages, auto-detected from the first 30 seconds of audio, one price across all of them. You can also manually override the detected language for mixed-language or accented content. Accuracy varies by language (high-resource languages cluster around the ~92% figure; lower-resource languages have more variance), but the workflow — paste URL, get five exports — is identical regardless of language.
Related reading
- How to transcribe a YouTube video without downloading it first — the URL-paste workflow in detail
- Podcast transcription: show notes in under 15 minutes — the same DOCX-to-blog workflow for audio-only feeds
- Interview transcription with speaker labels — diarization, renaming, multi-speaker edits
- Audio-to-text: formats, languages, accuracy — the technical reference for what the engine handles