Pulling a clean script from finished voiceover audio

If the original script is gone — or the client wants a Spanish version that matches the English line readings — you reverse-engineer it from the mixed audio. Run the finished file through a modern ASR engine, capture word-level timecodes, then clean the transcript of slate, retakes, and director chatter. The result is a working script: words, timing, and enough performance notes that a translator or a re-record artist can match the original.

This is not a transcription job in the podcast sense. A podcast transcript is for readers. A voiceover script is for performers and editors — it has to conform to picture, to the breath, and to the take. And finished audio is hostile to ASR: a music bed and SFX will push a stock model into hallucination unless you pre-process.

The three scenarios where this comes up

Most reverse-engineering requests fall into one of three buckets. Each one changes what the transcript needs to capture.

Script lost or never delivered. The agency shipped a 2019 radio spot, the writer left, the project folder is corrupted, and legal needs the copy for re-clearance. You can't paraphrase a pharma disclaimer or a financial-services mouse-print — the exact words are the deliverable. Words matter more than frame-accurate sync here.

Translation prep for dubbing or re-voicing. A 60-second English spot ships to seven markets. The translator needs the source script with line-by-line timecodes, because the localised line has to fit the same window — if the English read is 2.4 seconds, the Spanish has to be writable and performable inside 2.4 seconds. Syllable budget is the constraint. Timecodes are as important as the words.

Version-2 conform. The client kept the original read but wants three lines changed. The new VO artist needs to match cadence, emphasis, and breath placement on the unchanged lines so the new takes drop into the old edit. Here, the transcript is half script, half performance notation.

The takeaway: ask which scenario you're in before you start. It determines how much post-processing the raw ASR output needs.

Pre-process the mix before you transcribe

Run a finished broadcast master into a stock ASR and you get artefacts: snare hits read as plosives, synth swells as filler words, dropouts where the voice goes under the bed. The fix is vocal isolation first, transcription second.

Meta's Demucs v4 reaches a Signal-to-Distortion Ratio of 9.00 dB on vocals — high enough that the isolated stem transcribes cleanly even when the original mix is vocal-light. The isolated track can sound slightly artefacted to a human ear; that's fine, ASR doesn't care.

Modern ASR helps on its own too — current engines have significantly reduced hallucination on silent or music-only segments, which is directly relevant to finished VO where the voice drops out under a music bed. On clean podcast-grade English our engine lands around 90%+ word accuracy; a dense trailer mix is a different problem and you should still isolate vocals first.

What the transcript needs to capture beyond words

A voiceover script is not just a wall of text. Performance metadata matters.

Pauses: anything over ~300 ms should be marked. The translator needs to know there's a beat between "And then —" and "everything changed."
Breath: audible inhales before a phrase signal where the artist reset. Mark as [breath] or as a paragraph break with timing.
Emphasis: which word carries the stress. ASR doesn't give you this directly — you infer it from pitch and level on a second pass.
Non-verbal vocalisations: laughs, sighs, mouth clicks, grunts, efforts. Retained in dubbing and game scripts (they are assets), stripped from clean copy scripts.
Overlaps and ad-libs: anything off-script that made it into the final mix.

Word-level timestamps from the ASR — our engine and most enterprise APIs return them down to 10 ms increments — give you the skeleton. Boundary accuracy is approximate (the model places the timestamp where it thinks the word ends, not always where the waveform agrees), but it's close enough that a snap-to-zero-crossing in your DAW fixes it.

For dubbing work specifically, export JSON, not plain text. Feeding raw ASR JSON into a Claude 3 or GPT-4 pass with a 200k context window (Claude 3 shipped March 2024 with the JSON adherence that makes this reliable) lets you reformat into structured cue sheets — line number, in-time, out-time, source text, translation slot — without manual data entry.

Timecodes for picture-locked work

Broadcast spots, animation, and video game cinematics are picture-locked: the visuals don't move, the VO has to fit. The transcript has to carry timecodes that match the master video.

We export in three formats depending on the downstream tool:

SRT or VTT if the script is going into a subtitling or captioning workflow.
CSV with in/out timecodes per line if it's going to a translator or dub director.
Pro Tools-compatible text with timecode markers if the client is recutting in their DAW.

The trap: ASR timestamps are relative to the uploaded audio file. If the finished spot has 2 seconds of black/silence at the head, or you're working from a stem that doesn't start at 00:00:00:00, you have to offset everything. Confirm the head-pop frame before export.

For animation and games, where the picture may still be in flux, you sometimes want the inverse: a script with line lengths in milliseconds, no absolute timecodes, so the animator can match length without locking to a frame.

Try it on your audio

Start free →

90 minutes a month, no card.

Multilingual: source-then-translate, not direct multilingual ASR

You have two options for non-English source material, and the right answer is almost always option one.

Option 1: Transcribe in the source language, then translate. Run the Japanese VO through an ASR model that supports Japanese natively. Get a clean Japanese transcript with timecodes. Hand it to a human translator (or a separate LLM pass) with the timecodes attached.

Option 2: Direct multilingual transcribe-and-translate in one pass. Whisper will do this — Japanese audio in, English text out. The output looks plausible. The problem: direct-translation models naturalise the phrasing. They drop filler, compress idioms, and smooth breath pauses to make the English read well. That destroys your timecode-to-syllable mapping, which is the whole reason you're transcribing a VO in the first place.

We default to option 1. The translator needs the source script anyway, for context and client signoff. Skipping that step means you can't audit the translation against the original line.

A note on model selection for cross-lingual work: Whisper large-v3 (November 2023) moved to a 128-bin Mel spectrogram input, reducing cross-lingual errors 10–20% versus large-v2 depending on language. Across 99 languages we charge one price.

Cleanup: slate, retakes, director chatter

Working from a finished, mixed master, this section is short — slate and chatter are already gone. Working from raw studio session files (which happens when the master is lost and only the stems survived), it's most of the work.

What to strip, in rough order:

The slate: "Acme spot, take 4, mark." Always at the head.
Talkback from the booth: director and engineer feeding notes. Often on a separate channel — if so, mute that channel before ASR.
Retakes and false starts: pick the take that matches the final mix, keep alternates in an appendix if the client wants them, discard the rest. Don't silently merge a false start into the final sentence.
Brand terms and legal copy: ASR guesses unfamiliar names from sound. Check product names, URLs, drug names, and disclaimers by hand. "Terms apply" and "terms and conditions apply" are different lines when legal sign-off is involved.
Effort sounds: only strip these for clean copy scripts. For games and animation, breaths, grunts, and reactions are deliverables.

For director chatter, diarization helps. Stereo session files with the director on a separate channel are trivial — channel-split gives perfect separation. Mono sessions need a speaker-diarization pass. Our speaker diarization handles 2–4 speakers cleanly and degrades past six. For a typical VO session that's enough.

Where this fits against existing tools

Otter.ai and Rev are built for meetings and interviews — they assume conversational speech and don't expose the frame-accurate timecodes a dub house needs. Descript handles VO well and lets you edit audio by editing text, which is excellent for podcast work and rough for picture-locked broadcast. Rev's human transcription at $0.02/minute is overkill if you're doing 50 spots and want a programmatic pipeline; API ASR at $0.0061/minute (AssemblyAI Standard) or $0.0043/minute (Deepgram Nova-2, claimed 8.4% WER on general domains) is the order of magnitude that makes batch reverse-engineering economic.

We sit in the API-priced ASR bracket with word-level timestamps, 99-language support at one price, and JSON output that plays nicely with a downstream LLM cleanup step. See our audio-to-text pipeline for the full I/O spec, or the podcast transcription use case for a long-form example that shares most of the same plumbing.

What this can't do

Four honest limits:

Performance direction is inference. ASR gives you words and timing. Emphasis, sub-text, and intent are still a human pass.
Timestamp boundaries are approximate. 10 ms granularity in the JSON does not mean 10 ms accuracy at the waveform. Expect to nudge in your DAW.
We don't certify the output as the legal production script. If a spot has medical claims, financial copy, or broadcast clearance language, a human reviewer has to verify the extracted text against the approved source.
We are not a HIPAA BAA-covered product yet. Irrelevant for most VO work, relevant if the audio contains protected health information — medical narration, patient-education spots. Email us if that's the use case.

What next

Try the Free plan (30 minutes/month) with one finished spot you've lost the script for. Export JSON, not just text — you'll need the timecodes.
If the master is music-heavy, run Demucs v4 vocal isolation before ASR and realign timecodes against the master after.
For dubbing prep, pipe the JSON into a Claude 3 or GPT-4 reformatting pass to get a translator-ready cue sheet with in/out per line and a syllable budget per slot.
For agencies running batch jobs, the Pro plan at 600 audio-minutes/month (as of May 2026) covers roughly 20 finished spots with headroom for retake comparisons.

Voiceover script reverse-engineering: pulling a script from finished audio