Question 1

Which JSON shapes does this support?

Accepted Answer

Four providers: OpenAI Whisper verbose_json (top-level segments[]), AssemblyAI (utterances[] preferred, words[] fallback), Deepgram (results.channels[].alternatives[].words[] or paragraphs), and AWS Transcribe (results.items[] with start_time/end_time + alternatives). Provider is auto-detected from JSON shape; you can override with the dropdown.

Question 2

Is anything uploaded to your servers?

Accepted Answer

No. The parser and serializer run entirely in your browser (TypeScript compiled to JS, no network calls). Paste a sensitive transcript, never have it leave your machine.

Question 3

Why not just use Whisper or Deepgram directly?

Accepted Answer

You can — and many devs do. We add value when you need managed diarization, multi-format export (SRT/VTT/TXT/DOCX/JSON/PDF) out of one call, meeting bots for Zoom/Meet/Teams, LLM post-processing (summaries, SOAP notes, action items), and a dashboard so non-dev teammates can use the same pipeline. If all you need is a one-shot Whisper call, your stack is fine.

Question 4

How are word-level outputs (Deepgram / AssemblyAI words / AWS) chunked into cues?

Accepted Answer

Greedy chunker: max 6 seconds per cue, max ~84 chars (2 lines × 42), and a forced break on speaker change or sentence-final punctuation (. ! ?). This is a reasonable default for readable subtitles. If you need different chunking, the parsing library is exported at /lib/asr-json-parsers.ts and lives in this repo.

Question 5

What's the difference between SRT and VTT?

Accepted Answer

Both are time-aligned text formats. SRT uses comma decimal separators (00:00:01,500) and is the de-facto standard for video files and most editors. WebVTT uses period separators (00:00:01.500), supports HTML5 , CSS cue styling, and inline speaker tags via . Pick SRT for editors and uploads (YouTube, Premiere, DaVinci), VTT for browser

Question 6

Does this preserve speaker labels?

Accepted Answer

Yes when the source has them. AssemblyAI 'speaker' field, Deepgram 'speaker' integer (per word/paragraph), and AWS 'speaker_label' all get mapped to a 'Speaker {id}' prefix in cues. OpenAI Whisper has no native diarization, so no speaker labels there.

Question 7

How much can I paste? Is there a size limit?

Accepted Answer

Bounded by your browser's memory. A multi-hour transcription JSON (megabytes) parses fine on modern hardware. If your file is so large the textarea lags, drag-drop the .json file directly — it bypasses the textarea path.

Question 8

Is the output spec-compliant?

Accepted Answer

SRT output: zero-padded HH:MM:SS,mmm timecodes, blank line between cues, trailing newline. VTT output: 'WEBVTT' header, HH:MM:SS.mmm timecodes, <v Speaker> tags when speakers are present. Both validate as standard SubRip and WebVTT.

Convert Whisper / AssemblyAI / Deepgram JSON to SRT or VTT — Free

Input JSON

SubRip (.srt) output

Tired of stitching ASR pipelines together? Our API does transcription + diarization + SRT/VTT export in one call.

Paste in. Subtitles out. No server round-trip.

Paste JSON or drop a file

Auto-detect provider

Pick SRT or VTT

Download or copy

Four providers, one parser. Auto-detected.

OpenAI Whisper

AssemblyAI

Deepgram

AWS Transcribe

8 questions developers ask about this tool.

Stop hand-rolling SRT encoders. Ship the product.