YouTube's auto-captions miss scripture references, theological terms, and speaker turns. Here's the 15-minute workflow to download a YouTube sermon's audio, transcribe it with Whisper accuracy, and end up with a clean, publishable transcript.
Words like 'propitiation', 'eschatology', 'pneumatology', or 'Septuagint' rarely come through correctly. YouTube wasn't trained for theological speech.
YouTube outputs a single wall of caption text with rough punctuation. For a blog post or research notes, you need clean paragraphs — which is what Whisper produces.
You can scrape captions via a transcript panel, but you get raw text with no SRT/VTT/JSON export options. Sermon Transcription gives you all formats with one click.
Total time: 15 minutes for a 45-minute sermon. Total cost: $0.27.
Copy the full URL from the address bar — youtube.com/watch?v=XXXXX or youtu.be/XXXXX. For your own channel, you can also download direct from YouTube Studio (Content → ... menu → Download).
Option A (recommended): yt-dlp — install with brew install yt-dlp or pip install yt-dlp. Then run: yt-dlp -x --audio-format mp3 --audio-quality 5 URL. Quality 5 = ~96kbps which stays under our 25MB cap for sermons up to ~35 min.
Option B: Browser extension — Free Video Downloader (Firefox) or similar grab the file while you stream. Look for the MP4 option, then convert to MP3 with Audacity.
Option C: Screen record + extract audio — record the video playback with QuickTime (Mac), Game Bar (Windows), or OBS. Export audio with ffmpeg or Audacity.
Check the resulting file size. If it's over 25MB, re-export at lower quality. Quick ffmpeg recipe: ffmpeg -i input.mp3 -b:a 64k -ac 1 output.mp3. A 64kbps mono file is plenty good for transcription accuracy.
Drag and drop the MP3. Whisper auto-detects language. Standard tier at $0.006/min — a 45-minute sermon costs $0.27. Premium at $0.02/min adds speaker ID if the YouTube video has multiple speakers (panel, interview, Q&A).
Within 3-5 minutes you'll have TXT, SRT, VTT, and JSON exports. The TXT version has proper sentence punctuation, paragraph breaks, and accurate scripture references — far cleaner than YouTube's auto-captions.
In YouTube Studio: Subtitles → Add language → English → Upload file → Without timing or With timing (use With timing for the SRT). Publish. Your video now has accurate captions, which improves YouTube ranking and viewer retention.
For one 45-minute YouTube sermon, here's what you'd pay across the popular options.
| Method | Cost / sermon | Turnaround | Output quality |
|---|---|---|---|
| YouTube auto-captions | Free | Minutes after upload | ~75-85% accuracy, no formatting |
| Sermon Transcription (this site) | $0.27 | 5 minutes | 95-98% accuracy, paragraph-clean |
| Otter.ai (Business) | $20-30/mo subscription | Real-time | ~90% accuracy, monthly minute cap |
| Rev.com (AI) | $11.25 (at $0.25/min) | Same day | ~95% accuracy |
| Rev.com (Human) | $67.50 (at $1.50/min) | 12-24 hours | 99%+ accuracy |
For sermon use cases (where 95-98% machine accuracy + 10 min of editing is functionally identical to human work), Whisper-based transcription is the obvious winner.
Three steps, 15 minutes, $0.27.
Otter has Chrome extensions for YouTube. Here's how it compares head-to-head.
TurboScribe is another Whisper wrapper. Pricing and feature differences.
Detailed walkthrough on uploading SRTs to YouTube for accurate captions.
The full updated guide covering all sources, including YouTube.
Download, upload, get a transcript in 15 minutes. First 10 minutes free.
Start Free