Start free

Blog · · 6 min read

Transcription languages, one price: why we don't tier

We charge the same per minute for English, Japanese, or Swahili. Here's how AssemblyAI Universal-3 makes 99-language flat pricing possible — and where the catch actually sits.

99 languages, one price — why we don't tier by language

Transcription.Solutions charges the same per-minute rate for all 99 supported languages. A minute of Japanese costs the same as a minute of English, which costs the same as a minute of Swahili. This works because one model — AssemblyAI Universal-3 — handles every language in a single inference path, so there's no per-language compute cost to pass along. Internal accuracy does vary by language, but that's a quality consideration, not a pricing one.

What "one price across 99 languages" actually means

On the $19/month Pro plan you get 600 minutes of transcription. Those minutes are interchangeable: 600 minutes of English, or 600 minutes of Portuguese, or any mix. Overage is $0.04/minute on Pro and $0.02/minute on Business regardless of language. The same rule applies to the free tier (30 minutes/month) and the Business plan ($49/month for 2,500 minutes).

There is no "supported language" list with prices next to each entry. There is no Tier-1/Tier-2/Tier-3 structure. There is one rate card, and 99 languages share it.

How one model handles 99 languages

AssemblyAI Universal-3 is a single multilingual speech recognition model trained jointly on audio across all the languages it supports. When you upload a file, the model runs one inference pass that produces both the language identification and the transcript — there isn't a separate "language detection model" and "transcription model" billed independently.

Universal-3 is our primary backend. If it returns a transient error (network, 5xx), we fall back to OpenAI's Whisper Large-v3, which is also a single multilingual model. Either way, the architecture is the same shape: one model, one inference, all languages.

This matters for pricing because GPU time is roughly constant per minute of audio regardless of the language being recognised. A minute of Korean speech doesn't take more compute than a minute of Spanish. The cost we pay is per-second of audio processed, not per-language-supported. So we charge the same way.

Auto-detection from the first 30 seconds

You don't pick the language. The model listens to roughly the first 30 seconds of your file and identifies the spoken language automatically. That decision then guides recognition for the rest of the file.

This works well for clean single-language audio. It can stumble in two cases:

  • Mixed-language audio. A bilingual interview that switches between French and English mid-sentence, or a podcast where the host introduces a guest in one language and switches to another for the conversation.
  • Heavy code-switching. Some Indian-English, Mexican-Spanish, or Singaporean-English speakers blend languages within a sentence.

For both, you can override the auto-detected language manually in the audio-to-text settings. Manual override forces the model to treat the entire file as the language you specified, which usually beats letting it guess on ambiguous input.

Try it on your audio

Start free →

30 minutes a month, no card.

Accuracy does vary by language — but it's not a pricing question

This is where competitors sometimes justify tiered pricing: "high-resource languages cost more because they're better." We disagree with the framing.

Universal-3 hits the lowest word error rates on languages with the largest training corpora. As a rough internal pattern:

Language groupTypical WER on clean podcast audio
High-resource (English, Spanish, French, German, Mandarin, Portuguese, Japanese)3–8%
Mid-resource (Polish, Vietnamese, Turkish, Thai, Indonesian)8–15%
Low-resource (Swahili, Tamil, Welsh, Galician)15–40%

These numbers shift with audio quality. A clean studio recording in Tamil can outperform a noisy phone call in English. The pattern is real but not destiny.

The honest argument: if your Swahili transcript needs a heavier human review pass, that's your time, not our compute. Charging you double per minute for Swahili wouldn't make the model better — it would just make us money for the same inference. So we don't.

How this compares to tiered pricing models

A common pattern in transcription tooling is to bundle "premium languages" into higher-tier plans, or to charge a per-minute premium for languages outside a default set.

The pure "premium-language surcharge" model is mostly a holdover from human-transcription services. Among current AI-first tools, the more common constraint is which languages the tool supports at all — Otter.ai transcribes 3 (English, Spanish, French); Trint and Sonix list 50+; Rev charges the same per-minute rate across its supported languages. The pricing-tier-by-language pattern is rare in modern AI transcription. What hides behind a "premium" tier is usually access to higher concurrency limits, longer file lengths, or human-review add-ons — not better language coverage.

What we can say without speculation: when you compare a 60-minute Japanese interview to a 60-minute English interview on Transcription.Solutions, both consume 60 minutes from your quota. Both export to the same seven formats — TXT, SRT, VTT, DOCX, MD, JSON, PDF. Both get speaker diarization on Pro and Business. There is no language-conditional feature gating.

The catch you should know about

Two real caveats:

Low-resource language quality has a floor. If you transcribe a podcast in Yoruba or Pashto, expect to spend more time editing than you would on an English file. The model gets you a usable draft; it doesn't always get you a final transcript. This is true of every multilingual ASR system currently shipping, not just ours.

Diarization is language-agnostic but accent-sensitive. Our speaker diarization uses pyannote 3.1 for mono files (and a direct channel split for stereo). It separates speakers based on voice characteristics, not language, so it works the same across all 99 languages. But heavy accents or very similar-sounding speakers can confuse it regardless of language.

What this means for your case

If your work spans multiple languages — a journalist with sources in three countries, a researcher with field recordings from different regions, a podcaster who books international guests — flat per-language pricing changes the calculation. You're not picking the cheapest language to record in. You're picking the best source.

The free tier gives you 30 minutes a month with no card. That's enough to run two or three of your real files in the languages you actually work with, check the output, and decide whether the quality holds up for your specific accents and audio conditions. Quality varies by language — pricing doesn't.

FAQ

Which 99 languages are supported?

The list includes all major European languages (English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian), East Asian (Mandarin, Cantonese, Japanese, Korean), South Asian (Hindi, Tamil, Bengali, Urdu, Marathi, Telugu), Middle Eastern (Arabic, Hebrew, Persian, Turkish), Southeast Asian (Vietnamese, Thai, Indonesian, Tagalog), African (Swahili, Amharic, Afrikaans), and many smaller languages. Auto-detection covers all of them.

Can I transcribe a file with two languages in it?

Yes, but with caveats. The model identifies the dominant language in the first 30 seconds and uses that for the whole file. For files that switch languages, manually override the language setting to whichever one is dominant, then run a second pass on the other-language sections separately. Continuous code-switching mid-sentence is the hardest case — expect to clean up the output.

Does language affect transcription speed?

No. All 99 languages run through the same inference path at approximately 6× faster than realtime — a 60-minute file completes in 9–11 minutes regardless of whether it's in English or Vietnamese. Speed is a function of audio duration and our parallel chunking, not language complexity.

Is accuracy really worse for low-resource languages?

Yes, and we'd rather be honest about it. Languages with smaller training corpora (Swahili, Welsh, Tamil, Galician, and similar) typically land at 15–40% word error rate on real-world audio versus 3–8% for high-resource languages. Expect a heavier review pass. The free tier is the cheapest way to test whether the quality is good enough for your specific language and audio before committing.

Do I pay extra for automatic language detection?

No. Detection is part of the same inference that produces the transcript — it's not a separate billable step. The first 30 seconds of audio are used to identify the language, and that decision propagates through the rest of the file. Manual language override is also free.

What if my language isn't on the list?

Then we can't transcribe it well, and we'd rather you find that out before paying. Upload a 5-minute sample on the free tier — if the output is unusable, the language likely falls outside the 99 we support. We won't quietly produce nonsense and charge you for it; the file will either transcribe in the closest-matching supported language or return a low-confidence result you can see immediately.

Related reading