The seven transcription export formats, and when each one actually wins
We ship seven transcription export formats — TXT, DOCX, MD, PDF, SRT, VTT, JSON — and the right one depends entirely on what you do next. Subtitles need SRT or VTT. Legal review needs PDF. Editing a show-notes draft needs DOCX or MD. Feeding a transcript to an LLM is cleaner in MD. Building anything programmatic needs JSON. All seven are on every plan, including the free 30 minutes — no upsell.
The seven formats at a glance
| Format | Best for | Has timestamps? | Has speakers? |
|---|---|---|---|
| TXT | Quick read, paste into anything | No | No |
| DOCX | Editing, tracked changes, sharing | Optional | Yes |
| MD | LLM context, static sites, plain editing | Optional | Yes |
| Legal review, archive, sign-off | Yes | Yes | |
| SRT | Video subtitles (YouTube, Vimeo, most editors) | Yes (comma) | Inline |
| VTT | Web video (<track>), HLS captions | Yes (dot) | Inline |
| JSON | Pipelines, custom UI, search indexing | Word-level | Yes |
Every format comes out of the same job. You don't pick the format at upload — you export whichever ones you need from the transcript page afterwards.
TXT — the lowest-friction option
TXT is plain text. No formatting, no timestamps, no speaker labels. It's what you want when you're going to paste the transcript into an email, a Notion page, or a chat message and don't care about structure.
A 60-minute podcast lands at roughly 8,000–10,000 words of TXT, which is about 50 KB. Open it in anything. The trade-off: you lose speaker turns and timing, so it's a dead end for any workflow that needs to jump back to the audio.
DOCX — for editing and sharing with non-technical people
DOCX is the format to send to a client, an editor, or a lawyer who lives in Microsoft Word. It preserves speaker labels as bold headings and keeps optional inline timestamps. Track Changes works. Comments work.
Use DOCX when the transcript is a working document — an interview that needs cleanup before publication, a deposition draft, a research transcript that a co-author will mark up. If the destination is "someone opens it, edits it, sends it back", DOCX wins.
MD — the format that beats DOCX for LLM context
Markdown is the export to use when the transcript's next stop is an AI model — Claude, GPT, Gemini, an internal RAG system. LLMs read MD more cleanly than DOCX: no XML noise, no style table, just speaker labels as headers and turns as paragraphs.
It's also the right pick for static-site publishing (Hugo, Astro, Jekyll) and for anyone who edits in Obsidian, iA Writer, or VS Code. A podcast show-notes draft in MD goes straight into the post body — no copy-paste cleanup. DOCX makes a worse LLM input than MD, full stop.
PDF — for legal review and anything that needs a signature
PDF is for the version that gets archived, signed, or filed. Page numbers, fixed layout, embedded fonts. Speaker labels and timestamps render the same on every machine, which matters when someone three years from now opens the file as evidence or as a source citation.
Pick PDF for legal discovery, compliance archives, court submissions, academic dissertation appendices, and any deliverable where the recipient should see exactly what you saw. The PDF includes a header with the source filename and a generation timestamp.
SRT vs VTT — the timestamp gotcha that catches people
Both are subtitle formats. The difference is one character.
SRT is the older format. Timestamps use a comma for the decimal:
00:01:23,456 --> 00:01:26,789
VTT (WebVTT) is the web standard. Timestamps use a period:
00:01:23.456 --> 00:01:26.789
Use SRT for YouTube, Vimeo, most desktop video editors (Premiere, DaVinci Resolve, Final Cut), and OTT platforms. Use VTT for HTML5 <video> with <track>, HLS adaptive streaming, and anywhere the browser is the player. If you paste a VTT timestamp into a tool expecting SRT (or vice versa), parsers throw errors that look like "invalid timecode" — same content, wrong punctuation.
We don't burn captions into the video file. We deliver the SRT or VTT; you upload it alongside the video on YouTube or Vimeo, or load it as a <track> element. For burned-in subtitles, take the SRT into a video editor.
JSON — what engineers actually want
JSON is the export with all the data the other six throw away. Word-level timestamps. Per-word confidence scores. Speaker labels as speaker_0, speaker_1 (anonymous — your UI can rename them). Language detection result. Segment boundaries. Diarization method used (channel_split for stereo, pyannote for mono).
{
"language": "en",
"diarization_method": "channel_split",
"segments": [
{
"speaker": "speaker_0",
"start": 12.34,
"end": 18.92,
"text": "So the question was about latency.",
"words": [
{"text": "So", "start": 12.34, "end": 12.51, "confidence": 0.98}
]
}
]
}
Use JSON when you're building a search index over a podcast archive, generating clipped audio for social, training a downstream classifier, or rendering a custom transcript UI with click-to-play. The same JSON shape comes back from the REST API, so anything you prototype in the dashboard ports to programmatic use without rework.
Picking the right format for your workflow
A few common cases:
- Podcast publishing: MD for the show notes, SRT for YouTube upload, JSON if you build chapter markers.
- Journalism: DOCX for the editing pass, TXT to drop quotes into the article CMS.
- Legal / compliance: PDF for the file, JSON kept alongside as the machine-readable source.
- Internal AI / RAG: MD into the vector store, JSON if you need word-level alignment.
- Web video on your own site: VTT.
- YouTube, Vimeo, Premiere: SRT.
Export as many as you want — the job runs once, the formats are all generated from the same transcript. On the free 30-minute tier you get all seven, same as Pro and Business.
FAQ
What's the difference between SRT and VTT subtitle files?
Both store subtitle text with start/end timestamps. SRT uses a comma in timestamps (00:01:23,456) and is supported almost everywhere: YouTube, Vimeo, Premiere, DaVinci Resolve, Final Cut. VTT uses a period (00:01:23.456) and is the W3C standard for HTML5 <video> with <track> and for HLS captions. Same content, different punctuation — but most parsers are strict about which one they accept.
Can I export a transcript with word-level timestamps?
Yes — use the JSON export. Each word has its own start, end, and confidence value, alongside the segment-level speaker label and segment timestamps. The other formats (SRT, VTT, DOCX, PDF) only carry segment-level timestamps. JSON is also the only export that exposes the diarization method (channel_split for stereo, pyannote for mono) and the language-detection result.
Is markdown really better than DOCX for sending a transcript to ChatGPT or Claude?
Yes, for most cases. DOCX is XML inside a zip, and when an LLM ingests it, the structure adds tokens without adding meaning. MD is plain text with light syntax — speaker labels become headers, turns become paragraphs — and reads cleanly in the model's context window. Use DOCX when a human will edit the file in Word. Use MD when an LLM will read it.
Do I need to pay extra for certain export formats?
No. All seven formats — TXT, DOCX, MD, PDF, SRT, VTT, JSON — are available on every plan, including the free 30-minute tier. There is no "premium export" upsell. Run the job once, export whichever formats you need, re-export later if the workflow changes.
Will Transcription.Solutions burn subtitles into my video file?
No. We deliver SRT and VTT files that you upload alongside the video — YouTube, Vimeo, your CMS — or attach via an HTML5 <track> element. For burned-in (hard-coded) subtitles where the text is part of the picture, take the SRT into a video editor like Premiere, DaVinci Resolve, or Final Cut and render the export there.
What does a PDF transcript include that a DOCX doesn't?
The PDF is a fixed-layout, paginated version with a header showing the source filename and generation timestamp. Fonts are embedded, so the file looks identical on every machine, every OS, every reader — which matters for legal discovery, compliance archives, and signed sign-off documents. DOCX is editable and flows; PDF is final and fixed. Many teams export both: DOCX for the working copy, PDF for the archive.
Are speaker names in the export the ones I set in the UI?
Yes. The API and JSON output uses anonymous labels speaker_0, speaker_1, etc. — that's what the diarization step returns. Once you rename them in the dashboard (click the chip, pick a name), every subsequent export — DOCX, MD, PDF, SRT, VTT, TXT — uses the human names. Re-export after renaming if you exported before.
Related reading
- Audio to text: formats, languages, accuracy — what runs before the export
- Video to text and the SRT/VTT workflow — captioning end-to-end
- The REST API reference — the JSON export shape, served programmatically
- Podcast transcription workflow — MD + SRT in practice
- Interview transcription workflow — when DOCX wins