The transcription API, end to end, on one screen
A working integration with our transcription API is four HTTP calls: POST /auth/token for a JWT, POST /jobs with the audio file or URL, then either GET /jobs/{id} polling or a signed webhook to your endpoint, and finally GET /jobs/{id}/transcript?format=json to pull the result. A 60-minute file completes in 9–11 minutes — approximately 6× faster than realtime. Below: the exact request shapes, the failure modes that actually happen, and what your code should do about each one.
Auth: one JWT, then forget about it
Auth is a standard bearer JWT. You exchange an API key (issued in the dashboard) for a short-lived access token and a longer-lived refresh token.
POST /auth/token
Content-Type: application/json
{"api_key": "tsk_live_..."}
The response gives you access_token (valid 1 hour) and refresh_token (valid 30 days). Send Authorization: Bearer <access_token> on every subsequent call. When you get a 401 token_expired, hit POST /auth/refresh with the refresh token rather than re-exchanging the API key — it's faster and avoids rate limits on the key endpoint.
The same JWT works on the REST API regardless of plan. Free, Pro, and Business all share the same surface — only the quotas differ.
POST the file (or the URL — same endpoint)
There are two ways to submit a job, and they hit the same endpoint.
Direct file upload — multipart form, for files already on your machine:
POST /jobs
Authorization: Bearer <jwt>
Content-Type: multipart/form-data
file=@interview.mp3
diarize=true
language=auto
webhook_url=https://yourapp.com/hooks/transcription
URL ingestion — JSON body, for files behind a public URL or one of 1,500+ supported sources (YouTube, Vimeo, Twitter, TikTok, Instagram, podcast RSS, S3 presigned, etc.):
POST /jobs
Authorization: Bearer <jwt>
Content-Type: application/json
{
"source_url": "https://www.youtube.com/watch?v=...",
"diarize": true,
"language": "auto",
"webhook_url": "https://yourapp.com/hooks/transcription"
}
Both return 202 Accepted with a job ID:
{"job_id": "job_01HZ...", "status": "queued", "created_at": "..."}
A few things worth knowing before you ship this. File size caps follow your plan — 100 MB on Free, 2 GB on Pro, 5 GB on Business. Max duration is 30 minutes on Free, 10 hours on Pro and Business. Accepted formats: MP3, WAV, M4A, AAC, OGG, OPUS, FLAC, WEBM, MP4, MOV, MKV, AVI, WMV — full detail on the audio-to-text page. language=auto detects from the first 30 seconds across 99 languages; override it if you already know.
Poll, or let the webhook find you
You have two options for getting notified when the transcript is ready. Pick one — don't do both.
Option A — webhook (recommended). If you passed webhook_url in the POST, we send a signed POST to that URL on completed, failed, or cancelled. The signature is an HMAC-SHA256 of the raw body using your webhook secret (from the dashboard), in the X-Transcription-Signature header. Verify it before trusting the payload:
import hmac, hashlib
expected = hmac.new(secret.encode(), raw_body, hashlib.sha256).hexdigest()
if not hmac.compare_digest(expected, request.headers["X-Transcription-Signature"]):
abort(401)
Webhook deliveries retry with exponential backoff for up to 24 hours. Return any 2xx to acknowledge. If your endpoint is down for a day, the job is still completed — you just have to GET /jobs/{id} to pick it up.
Option B — polling. No webhook URL? Then poll GET /jobs/{id}. Statuses progress queued → processing → completed (or failed). Poll every 10–15 seconds; the Retry-After header on processing responses tells you the suggested interval. For a 60-minute file, expect 9–11 minutes total — roughly 6× faster than realtime via parallel chunking.
| Pattern | When it fits |
|---|---|
| Webhook | Server-side integrations, batch pipelines, anything with a public ingress |
| Polling | Local scripts, CLI tools, anything behind a firewall |
GET the transcript in the shape you want
When the job is completed, fetch the transcript in any of five formats:
GET /jobs/{id}/transcript?format=json
format=json— full structured response with word timings, speaker labels, confidence.format=txt— plain text, one paragraph per speaker turn.format=srtandformat=vtt— subtitle files for video workflows. We deliver these as files; we don't burn them into the video.format=docx— speaker-labelled Word document, the same one the dashboard exports.
The JSON shape, abbreviated:
{
"job_id": "job_01HZ...",
"language": "en",
"duration_seconds": 3612,
"speakers": [{"id": "speaker_0"}, {"id": "speaker_1"}],
"segments": [
{"speaker": "speaker_0", "start": 0.42, "end": 4.18,
"text": "So tell me about the experiment.", "confidence": 0.97}
]
}
Speaker labels in the API are anonymous: speaker_0, speaker_1, speaker_2, and so on. There are no auto-generated names. If your users want human-readable names, your UI handles the renaming — or send them to the dashboard, where a click on the speaker chip opens a popover with rename, filter, copy, and jump-to-first-turn. For interview workflows where you already know who's who from your own data, map speaker_0 → "Reporter" in your own code.
Diarization is automatic. Stereo files split by channel (L=speaker_0, R=speaker_1) with 100% confidence — no model needed. Mono files run through pyannote/speaker-diarization-3.1 to identify turns.
The failure modes that actually happen
These are the responses your code needs to handle. Skipping them is the most common reason an integration looks fine on day one and falls over in week two.
402 quota_exceeded — your monthly minutes are spent. On Pro, the response includes an overage_allowed: true flag if your account has overage enabled — submit again with ?accept_overage=true to bill at $0.04/min Pro / $0.02/min Business. Without that, the job is rejected. Surface this clearly to your user; don't retry silently. Quotas live on the pricing page.
415 invalid_audio — the file decoded to silence, was zero bytes, exceeded the duration cap, or wasn't a recognised media container. The body includes a reason field (unsupported_format, decode_failed, duration_exceeds_limit, empty_audio). Do not retry — the file is the problem.
5xx transient_error followed by automatic model fallback. Primary recognition runs on AssemblyAI Universal-3. On transient errors only — network, 5xx from the upstream, SDK exception — the job automatically falls back to Whisper Large-v3 and continues. You don't do anything. The final transcript response includes "engine": "whisper_fallback" so you know which model produced it. Quota errors from the primary do NOT trigger fallback; those surface to you as 402.
429 rate_limited — too many job submissions per second. Respect the Retry-After header. Concurrent job limits are 20 on Pro and 50 on Business; submitting a 21st job on Pro queues it, it doesn't fail.
Webhook signature mismatch. This is almost always a body-parsing issue — frameworks like Express or FastAPI mutate the raw body before your handler sees it. Compute the HMAC against the raw bytes, not the parsed JSON.
Privacy: what we keep and for how long
Source audio is permanently deleted from our infrastructure within 24 hours of job completion. Transcripts stay in your account until you delete them via the API or dashboard. We do not train models on your data. If your users ask, that's the answer — own it as a feature.
FAQ
How long does a transcription API job typically take?
Approximately 6× faster than realtime — a 60-minute file completes in 9–11 minutes. The job runs in parallel chunks, so a 4-hour file finishes in roughly 40 minutes rather than 24. Queue time on free and Pro plans is usually under 5 seconds; Business gets priority processing. If you need a hard SLA, polling gives you a real-time progress signal via the Retry-After header on processing responses.
Can I submit a YouTube URL directly to the transcription API?
Yes. POST to /jobs with a JSON body containing source_url set to the YouTube link (or TikTok, Vimeo, Twitter, Instagram, or any of 1,500+ sources we ingest via yt-dlp). The API downloads the media server-side, extracts audio, and runs the same pipeline as direct uploads. Useful for YouTube transcription workflows where you don't want to pull the file locally first.
What happens if my webhook endpoint is down when the job completes?
The job still completes successfully. We retry the webhook with exponential backoff for up to 24 hours. If every retry fails, the transcript stays in your account — fetch it with GET /jobs/{id}/transcript?format=json once your endpoint is back up. You can also list recent jobs with GET /jobs?status=completed&since=... to catch anything you missed during the outage.
How do I know which speech recognition model produced the transcript?
The transcript response includes an engine field. The value is assemblyai_universal_3 for primary recognition or whisper_large_v3 when fallback kicked in due to a transient error on the primary. Fallback only happens on network or 5xx errors, never on quota exhaustion — those surface to you as 402. The single-vendor mode (Whisper only) is also possible for self-hosted deployments and reports whisper_large_v3 as primary.
Is the transcription API rate-limited?
Yes, by concurrent jobs rather than per-second QPS. Pro allows 20 concurrent jobs; Business allows 50. Submitting beyond that returns 429 with a Retry-After header, or — if you have queueing enabled on your account — accepts the job and runs it when a slot frees up. There is no per-minute submission cap below those concurrency limits, so burst submissions are fine within the concurrency budget.
Does the API support real-time streaming transcription?
No. The current API is batch only — POST a file or URL, get a transcript when the job completes. Latency is 9–11 minutes for a 60-minute file, not seconds. If you need live captions during a call, this is the wrong tool. If you need a transcript shortly after a meeting ends, batch is the right tool and avoids the cost overhead of streaming infrastructure.
How are speakers labelled in the API response?
Speakers are returned as speaker_0, speaker_1, speaker_2, and so on — anonymous integer labels, not names. For stereo files, the labels map to channels (left=speaker_0, right=speaker_1). For mono files, pyannote/speaker-diarization-3.1 identifies turns. Rename them in your own application code if your users need human-readable labels, or send them to the dashboard where the speaker popover handles renaming, filtering, and per-speaker copy.
Related reading
- Audio to text — supported formats and accuracy — the file types and the real-world accuracy numbers behind the API
- Video to text — SRT and VTT export — what changes when the input is video instead of audio
- Interview transcription workflow — how journalists use diarization + rename in practice
- Pricing and quotas — how minutes, file size caps, and concurrency map to plans