Start free

Blog · · 9 min read

Telehealth visit transcription for Zoom, Doxy.me, and Healthie

Zoom for Healthcare, Doxy.me, and Healthie each handle recordings differently. How to transcribe telehealth visits into chart notes without breaking HIPAA flow.

Telehealth visit transcription for Zoom, Doxy.me, and Healthie

The short answer: you can transcribe a telehealth visit cleanly from all three platforms, but the file flow is different for each — Zoom for Healthcare hands you a cloud MP4 with a separate M4A audio track, Doxy.me requires you to record locally in the browser or via a screen recorder, and Healthie attaches the recording to the appointment object you can pull via API. Diarization works on mono telehealth audio for two speakers (clinician + patient), and the transcript pastes into most EHR chart notes as plain text.

What we will not promise: we are not a HIPAA BAA-covered product yet. If your compliance officer requires a signed BAA before any PHI leaves your EHR, stop here and email us — we are piloting. Otter on its Enterprise tier and dedicated medical scribes like Dragon Medical One are built around signed BAA coverage today; if that is non-negotiable this quarter, those are the honest comparisons. Everything below assumes you have either de-identified the audio, gotten patient consent for a non-BAA transcription vendor, or you are evaluating workflow before signing paperwork.

Where each platform stores the recording

The three platforms diverge at recording time, which is the first decision that affects your chart-note flow.

Zoom for Healthcare records to Zoom Cloud (if your admin enabled cloud recording) or to the host's local disk. Cloud recordings produce an MP4 (video+audio), an M4A (audio-only), and a VTT auto-caption file if Zoom's captions ran. The M4A is what you want — smaller, and the transcription engine doesn't care about the video frame. Files live in Zoom Cloud for 30, 60, 120 days, or indefinitely depending on plan settings. Zoom signs a BAA on Zoom for Healthcare accounts.

Doxy.me does not record server-side at all — the platform is peer-to-peer by design, which is part of how they argue their compliance posture. The free and Professional tiers offer no recording. The Clinic and Enterprise tiers include a browser-based Session Recording add-on that writes to the clinician's local disk as WebM. No cloud storage, no API. You leave the visit with a WebM on your laptop and decide what to do with it — and if you forgot to press record before the patient joined, the audio is gone.

Healthie stores telehealth recordings as attachments on the appointment record when you enable Record Telehealth Session in practice settings. The file is an MP4 retrievable through the Healthie GraphQL API (appointment.zoom_appointment.recording_urls if you are on Zoom-backed telehealth, or the native recording field on newer accounts). Because Healthie's video is typically Zoom under the hood, the same visit can leave artifacts in both systems — audit both retention policies.

Takeaway: Zoom and Healthie give you a cloud URL or API handle. Doxy.me gives you a local file or nothing.

Getting the file out

Zoom for Healthcare

Three options, in order of operational simplicity:

  1. Manual download. Host signs in to Zoom web, navigates to Recordings, downloads the M4A, uploads it to the transcription tool. Two minutes per visit. Fine for under five visits a day.
  2. Cloud Recording webhook. Zoom fires a recording.completed event with a download_url and a short-lived download_access_token. Your backend grabs the audio, sends it to transcription, writes the result wherever you want. This is the path once you cross a handful of visits a day.
  3. Recall.ai meeting bot joining the call as a participant. We use Recall.ai under the hood for our meeting-bot transcription — the bot appears in the participant list under a name you configure, posts a two-party consent disclosure in chat on join, and the audio never touches Zoom Cloud. The honest caveat: a bot in a patient visit can be jarring, the chat disclosure is mechanical, and many clinicians prefer the webhook path for that reason. Bots make more sense for staff meetings and case conferences than for first-time patient visits.

Doxy.me

One real option: the local WebM. Upload via the web UI, or drop a watch-folder on your desktop that uploads anything added to ~/Doxy Recordings/. There is no recording API.

Without the Clinic-tier add-on, the alternative is a screen recorder (QuickTime on macOS, OBS on Windows) capturing system audio — check your state's recording laws and document consent before doing this.

Healthie

Pull the recording URL via the GraphQL API after the appointment closes. The endpoint returns a signed S3-style URL good for a limited window. Fetch the MP4, extract audio (ffmpeg -vn -acodec copy), send to transcription. The appointment.completed webhook is the natural trigger.

Diarization on mono telehealth audio

Telehealth recordings are almost always mono — Zoom downmixes participants into a single channel on cloud recording, Doxy.me's browser recorder captures a mono mix, and Healthie inherits whatever Zoom produced. This matters because the cleanest possible diarization is channel-split stereo (left = clinician, right = patient), which gives perfect attribution. You do not get that on telehealth out of the box. The only way to land stereo is a local recording setup that routes your mic to one channel and the remote audio to the other — possible with OBS or a virtual audio cable, not the default.

What we run on mono telehealth audio: pyannote-3.1. Good for 2 speakers (clinician + patient), still good at 3 (add a parent or interpreter), degrades past 6. For a typical solo visit, expect clean alternating turns labeled Speaker A and Speaker B — you map them once in the UI and the transcript renames everywhere.

Edge cases that fail:

  • Patient and clinician talk over each other for more than 2-3 seconds. The diarizer picks one.
  • Patient is on a phone bridge while the clinician is on broadband. The phone audio drops to 8 kHz and word error rate climbs to roughly 17.7% on that speaker's segments — versus ~7.88% WER on 16 kHz clean podcast English on our primary engine, AssemblyAI Universal-3. Telehealth sits between those two cases depending on the patient's mic and connection.
  • A third voice enters briefly (a nurse, a partner off-camera) and gets labeled as the clinician because their turns are short.

Our model is general-purpose, not a specialized medical ASR. It handles common medications, anatomy, and standard psychiatric vocabulary well; it struggles with rare drug names, complex dosages, and laterality when audio is muffled. Medication names, allergies, negations ("no chest pain" vs "chest pain"), and dates are exactly the tokens you have to read carefully before signing anything.

Try it on your audio

Start free →

30 minutes a month, no card.

Transcript, draft, signed note — keep them separate

This is where most clinicians lose time, so we will be specific.

The transcript is what the audio engine heard. The chart note is the clinical record. Do not collapse the two. The workflow we see survive an audit:

  1. Transcribe the recording. Output is text with speaker labels and timestamps.
  2. Run a SOAP-format prompt over the transcript with whatever LLM your practice has approved — we do not generate SOAP notes ourselves, and we do not ship clinical summarization. You wire that step, or you write the note manually.
  3. Paste the draft into the EHR. Review against the transcript. Edit. Sign.

For EHRs that strip formatting (most): copy as plain text. Headings collapse to lines like "Subjective:" which is what you want. Healthie's chart-note field renders markdown on the patient-facing side; internally it stores raw text.

Exports we ship: TXT, DOCX, SRT, VTT, PDF, JSON. TXT is the right choice 90% of the time for chart pasting. DOCX is for when a supervising clinician wants to track-changes the draft before signing.

What we don't ship: direct write-back into Healthie, Epic, Cerner, or SimplePractice. You paste. If you want to automate it, the transcript and any LLM-generated draft are available via webhook and you can wire them into your EHR's API — but we are not the integration layer.

A typical 45-minute therapy session generates around 6,000 words of raw transcript. Almost no one wants that in the chart. The transcript is reference material; the signed note is the record.

HIPAA chain of custody

The honest version, because this is the question your compliance officer will ask.

Custody hops for a Zoom Cloud → our transcription → EHR flow:

  1. Audio is created in the browser on clinician and patient devices.
  2. Zoom Cloud stores the recording. Zoom signs a BAA on Zoom for Healthcare.
  3. Your backend (or a manual download) fetches the file. PHI is briefly in transit.
  4. Transcription engine processes the audio. PHI sits at rest with the transcription vendor for the duration of processing and any retention window.
  5. Transcript returns to your system. You paste into the EHR.

Step 4 is the one that matters. We handle data with HIPAA-grade practices at rest — AES-256 encryption, access controls, audit logs — but we do not currently sign BAAs. Per HHS HIPAA guidance, a vendor that creates, receives, maintains, or transmits PHI for a covered entity is generally a business associate, so the BAA is the contractual gate, not the technical posture. AssemblyAI does sign BAAs directly with healthcare customers on their own platform; going to them, or to a transcription vendor with signed BAA coverage, is the right move today if BAA is required.

What we recommend for clinicians evaluating us right now:

  • De-identified test recordings only — a colleague playing a patient role, not real PHI.
  • If you go to production with real PHI before we have BAA coverage, document patient consent for transcription by a non-BAA vendor. Some practices already do this for human scribes and the language is portable.
  • Delete audio from our system after the transcript is finalized. The delete endpoint is in the API and the UI; our opt-out endpoint at /opt-out/{token} ensures audio is never used in future model work.

Consent and state recording law

HIPAA is not the only rule. State two-party consent laws and professional licensing boards may require explicit recording consent on every visit. A reusable script, said at the top of the call:

"Before we start, I'd like to record the audio of this visit to help prepare your chart note. The recording will be used for documentation and handled under our privacy policy. Are you okay with that?"

Document the verbal yes in the chart. Our meeting bot posts a disclosure in chat on join, but chat disclosure alone does not satisfy clinical or state-law consent requirements — your practice still owns that process.

For the broader picture beyond telehealth — dictation, in-person visits, SOAP from voice memos — our page for clinicians covers the full set.

A 30-minute trial workflow

If you want to test fit before any procurement conversation, do this today with one recording:

  1. Pick a non-PHI recording — a continuing-education webinar, a mock visit with a colleague, any two-person audio of 10-30 minutes.
  2. Sign up for the Free plan (30 audio-minutes/month, exports unlocked, as of May 2026). Upload.
  3. Run transcription. Check speaker labels. Rename Speaker A and Speaker B.
  4. Copy as plain text. Paste into a blank chart-note template in your EHR sandbox.
  5. Time yourself editing it into a finished SOAP note. Compare against your current dictation or post-visit writing time.

That is the evaluation. If the edit time beats your current note-writing time and the diarization survives your speech patterns, the rest is plumbing — webhook from Zoom, API pull from Healthie, watch-folder from Doxy.me.

What next

  • Try the audio-to-text pipeline with one non-PHI recording on the Free plan to gauge accuracy on your actual audio conditions.
  • If you run more than 20 visits a week, read the meeting-bot path in our meeting-notes feature page — Recall.ai joining the Zoom call avoids the Zoom Cloud retention question, with the caveat that bots can feel intrusive in patient visits.
  • If signed BAA coverage is a hard requirement, email us before piloting. We will tell you honestly where we are and whether AssemblyAI's direct enterprise offering is a better fit this quarter.
  • Compare plans on the pricing page — Pro at 600 audio-minutes/month covers roughly 40 hour-long visits, Business at 5000 minutes covers a small group practice.