Free developer tool · 100% client-side

Convert Whisper / AssemblyAI / Deepgram JSON to SRT or VTT — Free

No more pipeline glue. Paste your transcription JSON, get standard subtitle files. Runs entirely in your browser — nothing uploads. Auto-detects four providers; override if it guesses wrong.

Input JSON

Load sample:

SubRip (.srt) output

Output appears here. Paste JSON on the left to convert.

For the devs who built this pipeline

Tired of stitching ASR pipelines together? Our API does transcription + diarization + SRT/VTT export in one call.

$0.0025/minwith diarization included, 100+ languages, REST + webhooks, JWT auth. No quotas to babysit across providers, no SRT-encoder dependencies, no "wait, was that timestamp in seconds or milliseconds?" bugs.

See API docs →Pricing

How it works

Paste in. Subtitles out. No server round-trip.

  1. 01

    Paste JSON or drop a file

    Run your usual transcribe call against Whisper, AssemblyAI, Deepgram, or AWS. Paste the response or drop the .json file on the textarea.

  2. 02

    Auto-detect provider

    We inspect the JSON shape — segments[], utterances[], results.channels[], results.items[] — and pick the matching parser. Override the guess if your payload is unusual.

  3. 03

    Pick SRT or VTT

    Toggle the output format. Speaker labels (when present in source) are mapped to 'Speaker A/B/...' in SRT or <v Speaker> tags in VTT.

  4. 04

    Download or copy

    Get a spec-compliant .srt or .vtt file ready for YouTube, Premiere, DaVinci, HTML5 <track>, or anywhere else subtitles go.

Supported JSON shapes

Four providers, one parser. Auto-detected.

OpenAI Whisper

response_format=verbose_json

Shape: top-level segments[] with start/end (seconds)

No native diarization — speaker labels not produced.

AssemblyAI

transcript JSON

Shape: utterances[] (preferred) or words[], ms timestamps

Speaker labels (A/B/...) preserved when speaker_labels=true.

Deepgram

Nova / Whisper-hosted

Shape: results.channels[].alternatives[] — paragraphs or words

punctuated_word used when present. Integer speaker mapped to label.

AWS Transcribe

JSON output

Shape: results.items[] with start_time (string) + alternatives

Punctuation items attached to preceding word. speaker_label respected.

Common questions

8 questions developers ask about this tool.

01Which JSON shapes does this support?+
Four providers: OpenAI Whisper verbose_json (top-level segments[]), AssemblyAI (utterances[] preferred, words[] fallback), Deepgram (results.channels[].alternatives[].words[] or paragraphs), and AWS Transcribe (results.items[] with start_time/end_time + alternatives). Provider is auto-detected from JSON shape; you can override with the dropdown.
02Is anything uploaded to your servers?+
No. The parser and serializer run entirely in your browser (TypeScript compiled to JS, no network calls). Paste a sensitive transcript, never have it leave your machine.
03Why not just use Whisper or Deepgram directly?+
You can — and many devs do. We add value when you need managed diarization, multi-format export (SRT/VTT/TXT/DOCX/JSON/PDF) out of one call, meeting bots for Zoom/Meet/Teams, LLM post-processing (summaries, SOAP notes, action items), and a dashboard so non-dev teammates can use the same pipeline. If all you need is a one-shot Whisper call, your stack is fine.
04How are word-level outputs (Deepgram / AssemblyAI words / AWS) chunked into cues?+
Greedy chunker: max 6 seconds per cue, max ~84 chars (2 lines × 42), and a forced break on speaker change or sentence-final punctuation (. ! ?). This is a reasonable default for readable subtitles. If you need different chunking, the parsing library is exported at /lib/asr-json-parsers.ts and lives in this repo.
05What's the difference between SRT and VTT?+
Both are time-aligned text formats. SRT uses comma decimal separators (00:00:01,500) and is the de-facto standard for video files and most editors. WebVTT uses period separators (00:00:01.500), supports HTML5 <track>, CSS cue styling, and inline speaker tags via <v Speaker>. Pick SRT for editors and uploads (YouTube, Premiere, DaVinci), VTT for browser <video> playback.
06Does this preserve speaker labels?+
Yes when the source has them. AssemblyAI 'speaker' field, Deepgram 'speaker' integer (per word/paragraph), and AWS 'speaker_label' all get mapped to a 'Speaker {id}' prefix in cues. OpenAI Whisper has no native diarization, so no speaker labels there.
07How much can I paste? Is there a size limit?+
Bounded by your browser's memory. A multi-hour transcription JSON (megabytes) parses fine on modern hardware. If your file is so large the textarea lags, drag-drop the .json file directly — it bypasses the textarea path.
08Is the output spec-compliant?+
SRT output: zero-padded HH:MM:SS,mmm timecodes, blank line between cues, trailing newline. VTT output: 'WEBVTT' header, HH:MM:SS.mmm timecodes, <v Speaker> tags when speakers are present. Both validate as standard SubRip and WebVTT.

Stop hand-rolling SRT encoders. Ship the product.

We handle transcription, diarization, subtitle export, meeting bots, LLM post-processing, and a dashboard — all behind one REST API at $0.0025/min. Your future self thanks you.

See API docs →

Just need the tool? Bookmark this page — it works offline once loaded.