The transcription API, end to end, on one screen

A working integration is three HTTP calls: send your API key as a header, POST the audio file (or a URL) to create a job, poll GET /jobs/{id} until it's done, and read the transcript — it comes back inline, or you can export it in the format you want. No token exchange, no SDK required. Below: the exact request shapes, the real status flow, and the failure modes that actually happen.

The base URL for every call is https://api.transcription.solutions/api/v1.

Auth: one header, no token dance

There's no OAuth handshake and no JWT to manage. Create an API key in the dashboard (Settings → API keys), then send it as X-API-Key on every request.

GET /api/v1/jobs?limit=5
X-API-Key: ts_...

API access is available on Pro and Business plans. The key is shown once at creation — store it somewhere safe. You can hold up to five keys at a time and revoke any of them from the dashboard.

POST the file — or the URL

There are two create endpoints, depending on where the media lives.

Direct file upload — multipart form, for files on your machine:

POST /api/v1/jobs
X-API-Key: ts_...
Content-Type: multipart/form-data

file=@interview.mp3

URL ingestion — JSON body, for a public file or any of ~1,500 supported sources (YouTube, Vimeo, podcast RSS, a direct CDN mp3, and more). The endpoint is different and the field is url:

POST /api/v1/jobs/from-url
X-API-Key: ts_...
Content-Type: application/json

{
  "url": "https://www.youtube.com/watch?v=...",
  "language": "auto"
}

Both return the new job:

{"id": "44acf4ef-0de9-489e-bb0d-26004f1c9e80", "status": "queued"}

The job identifier is in the id field (a UUID). language: "auto" detects the spoken language automatically — override it with an ISO code if you already know it. File-size and duration caps follow your plan (Free tops out at 30 minutes per file and 30 minutes per month; Pro and Business allow up to 10 hours per file) — see the pricing page for the current limits, and the audio-to-text page for the accepted formats.

Poll until it's done

Realtime push (completion webhooks, WebSocket) isn't available on the API-key path yet — poll GET /jobs/{id}:

GET /api/v1/jobs/44acf4ef-0de9-489e-bb0d-26004f1c9e80
X-API-Key: ts_...

Status moves through a real pipeline, not a single processing blob:

queued → downloading → extracting → transcribing → diarizing → analyzing → done

(downloading only appears for URL jobs; uploads skip it. A job that can't complete ends in failed with an error_message and failure_code.) Poll every few seconds. Short clips finish in seconds; longer files scale with their duration.

When status is done, the transcript is already in the same response — no extra call needed:

{
  "id": "44acf4ef-...",
  "status": "done",
  "progress": 100,
  "duration_sec": 6.04,
  "language": "auto",
  "transcription": {
    "id": "…",
    "full_text": "…",
    "language_detected": "en"
  },
  "translations": []
}

Try it on your audio

Start free →

90 minutes a month, no card.

Export in the shape you want

For anything richer than plain text, hit the export endpoint:

GET /api/v1/jobs/{id}/export?format=json
X-API-Key: ts_...

Seven formats are supported: txt, srt, vtt, docx, md, pdf, and json. The subtitle formats (srt, vtt) are delivered as files for your video workflows — we don't burn them into the video.

The json export is a stable, versioned shape:

{
  "schema_version": "1",
  "file": "interview.mp3",
  "language": "en",
  "generated_at": "…",
  "duration_seconds": 3612,
  "text": "…",
  "word_timestamps": [ … ],
  "diarization": { … },
  "summary": "…",
  "key_points": [ … ],
  "action_items": [ … ],
  "topics": [ … ],
  "speaker_contribution": { … }
}

diarization carries the speaker-attributed segments. Speaker labels are anonymous — there are no auto-generated human names. If your users want real names, map the labels in your own application code, or send them to the dashboard, where a click on a speaker chip opens a popover with rename, filter, copy, and jump-to-first-turn. For interview workflows where you already know who's who, do the mapping in your code.

AI add-ons (Pro and up)

A finished job can be enriched with a single POST each:

POST /api/v1/jobs/{id}/summarize     # executive summary, key points, action items
POST /api/v1/jobs/{id}/polish        # paragraph breaks + punctuation cleanup
POST /api/v1/transcriptions/{transcription_id}/translate
Content-Type: application/json
{"target_lang": "es"}

The transcription_id for translate is the transcription.id from the job response above.

The failure modes that actually happen

Handle these and your integration survives week two:

401 / 403 — missing or invalid X-API-Key, or a key on a plan without API access. Errors on auth come back as {"detail": "..."}.
404 Job not found — wrong id, or a job that belongs to a different account. Keys are scoped to the account that created them.
failed status — the job ran but couldn't finish (unsupported or silent audio, a URL that wasn't media, a download that was blocked upstream). Read error_message and failure_code from the job rather than retrying blindly — the input is usually the problem.
402 when you're out of minutes — your monthly quota is spent. Surface it to the user; don't retry silently. Quotas live on the pricing page.
429 rate_limited — the API allows 60 requests per minute per key. Respect the response and back off; batch your polling rather than hammering a single job.
Application errors are JSON of the form {"detail": {"code": "...", "message": "..."}} — branch on code, show message.

Privacy: what we keep and for how long

Source audio is permanently deleted from our infrastructure within about 24 hours of job completion — you can see the deletion timestamp on the job itself. Transcripts stay in your account until you delete them via the API or the dashboard, and DELETE /jobs/{id} cleanly removes a job and its file while preserving your billing audit trail. We do not train models on your data. If your users ask, that's the answer — own it as a feature.

FAQ

How long does a transcription API job typically take?

It scales with the length of the audio, not a fixed SLA. Short clips finish in seconds; a long recording takes proportionally longer as it moves through download, audio extraction, transcription, and speaker analysis. Poll GET /jobs/{id} for a live status and progress value rather than guessing — the job tells you exactly which stage it's in.

Can I submit a YouTube URL directly to the transcription API?

Yes. POST /jobs/from-url with a JSON body containing url set to the link (or any of ~1,500 sources we ingest — TikTok, Vimeo, podcast RSS, direct CDN files, and more). The API downloads the media server-side, extracts the audio, and runs the same pipeline as a direct upload. Handy for YouTube transcription workflows where you don't want to pull the file locally first.

Does the API send a webhook when a job finishes?

Not on the API-key path yet — completion webhooks and the WebSocket stream currently require a dashboard session. For server-to-server integrations, poll GET /jobs/{id} until status is done (or failed). You can also list recent jobs with GET /jobs?limit=... to reconcile anything in flight.

Is the transcription API rate-limited?

Yes — 60 requests per minute per key. That's plenty for create-then-poll integrations; just space your polling out (every few seconds, not every 200 ms) and you'll never see a 429.

Does the API support real-time streaming transcription?

No. The API is batch only — POST a file or URL, then poll for the transcript when the job completes. If you need live captions during a call, this is the wrong tool. If you need an accurate transcript shortly after a meeting or recording, batch is the right fit and avoids the cost of streaming infrastructure.

How are speakers labelled in the API response?

The diarization object in the JSON export returns speaker-attributed segments with anonymous labels — not names. Rename them in your own application code if your users need human-readable labels, or send them to the dashboard, where the speaker popover handles renaming, filtering, and per-speaker copy.

Transcription API in one screen: POST a file, GET a transcript