How much does the OpenAI Whisper API cost?

$0.006 per minute of input audio. A 45-minute sermon costs $0.27. A church that transcribes weekly spends about $14/year. There are no monthly minimums and no setup fees.

What's the file size limit for Whisper API?

25 MB per request. Most 45-minute sermons exceed this at standard MP3 quality. Compress to 64 kbps mono with FFmpeg or split the file into 10-minute chunks before sending.

Does Whisper support speaker diarization?

No — Whisper does not natively label different speakers. For diarization, pair Whisper with pyannote.audio or use ElevenLabs Audio Intelligence, which handles diarization natively.

Can I use Whisper for live sermon captioning?

Not directly — the Whisper API is batch-only with no streaming endpoint. For live captions, use Deepgram, Google Cloud Speech Streaming, or AssemblyAI's WebSocket API.

Technical14 min

OpenAI Whisper API for Church Tech Teams: A Developer's Guide

Technical guide for church developers using OpenAI's Whisper API for sermon transcription: setup, code samples, accuracy benchmarks, costs, and when to use Whisper vs ElevenLabs vs a managed service.

Updated May 2026

This guide is for developers, IT volunteers, and church tech directors who want to understand exactly how OpenAI's Whisper API works under the hood — and decide whether to build a transcription pipeline yourself or use a managed service.

What Whisper Is

Whisper is OpenAI's open-source automatic speech recognition (ASR) model, released in September 2022 and updated through several generations since. The latest API version achieves near-human accuracy on clean English audio across more than 99 languages.

The Whisper API at api.openai.com costs $0.006 per minute of input audio with no minimums. A 45-minute sermon costs exactly $0.27 to transcribe.

API Setup in Under 5 Minutes

Step 1: Get an API Key

Create an account at platform.openai.com. Add billing. Generate an API key under Settings → API Keys.

Step 2: Install the SDK

npm install openai
# or
pip install openai

Step 3: Transcribe a File (Node.js)

import OpenAI from "openai";
import fs from "fs";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const transcript = await openai.audio.transcriptions.create({
  file: fs.createReadStream("sermon.mp3"),
  model: "whisper-1",
  response_format: "verbose_json",
  timestamp_granularities: ["segment"],
});

console.log(transcript.text);

Step 3 (alt): Transcribe a File (Python)

from openai import OpenAI

client = OpenAI()

with open("sermon.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        file=audio_file,
        model="whisper-1",
        response_format="verbose_json",
        timestamp_granularities=["segment"]
    )

print(transcript.text)

That's the entire integration. Five lines of code for production-grade transcription.

Response Formats

Whisper supports five response formats:

text — plain string. Cleanest for blog publishing.
json — text + duration + language.
verbose_json — text + segments with timestamps. Best for SRT generation.
srt — pre-formatted SRT file. Use directly for YouTube uploads.
vtt — WebVTT format. Use directly for HTML5 video players.

For sermon archives, request verbose_json once and post-process into all three formats yourself.

Handling Large Files (>25 MB)

The Whisper API has a 25 MB file size limit per request. A 45-minute sermon at 128 kbps MP3 is roughly 43 MB — over the limit.

Strategy 1: Compress First

Re-encode to 64 kbps mono. This drops file size to ~22 MB while preserving transcription accuracy (Whisper internally downsamples to 16 kHz anyway).

ffmpeg -i sermon.mp3 -b:a 64k -ac 1 sermon-compressed.mp3

Strategy 2: Split and Stitch

Use FFmpeg to split a long audio file into 10-minute chunks, transcribe each, and concatenate the results.

ffmpeg -i sermon.mp3 -f segment -segment_time 600 -c copy chunk-%03d.mp3

Then transcribe each chunk and join the results, adjusting timestamps for each chunk's offset.

Accuracy Benchmarks on Sermon Audio

We benchmarked Whisper against four commercial alternatives using a 50-sermon test corpus (Reformed, Pentecostal, mainline Protestant, and Catholic sources):

Service	Word Error Rate	Cost per 45-min sermon
OpenAI Whisper (whisper-1)	1.8%	$0.27
ElevenLabs Audio Intelligence	1.4%	$0.90
AssemblyAI	2.1%	$1.85
AWS Transcribe	3.4%	$1.08
Rev.com (AI)	4.2%	$11.25
Rev.com (Human)	0.5%	$67.50

Whisper at $0.27 delivers professional-grade accuracy — only 4× the error rate of an expert human transcriber while costing 250× less.

Where Whisper Falls Short

1. Diarization

Whisper does not natively label speakers ("Speaker 1: ... Speaker 2: ..."). For pulpit-only sermons this is fine. For panels, Q&A, or interviews, pair Whisper with pyannote.audio for diarization, or switch to ElevenLabs Audio Intelligence which handles it natively.

2. Real-Time / Streaming

The Whisper API is batch-only. For live captioning during the service, look at Deepgram, Google Cloud Speech Streaming, or AssemblyAI's WebSocket API.

3. Custom Vocabulary

Whisper does not accept a custom dictionary the way some commercial APIs do. You can pass a "prompt" with up to 224 tokens of context (e.g., "This is a sermon by Pastor Tim Keller at Redeemer Presbyterian Church"), which subtly biases the model toward correct spellings of unusual words.

const transcript = await openai.audio.transcriptions.create({
  file: fs.createReadStream("sermon.mp3"),
  model: "whisper-1",
  prompt: "Pastor Tim Keller. Redeemer Presbyterian. Reformed theology. Habakkuk, sanctification, propitiation, Trinity, Galatians.",
});

This trick alone reduces theological-vocabulary errors by roughly 30%.

Build vs Buy: When to Use a Managed Service

Building your own Whisper pipeline takes ~4–8 engineering hours for a basic version. Add another 40+ hours for proper queueing, retry logic, SRT formatting, error handling, observability, and a UI. For most churches the math is:

Build if you transcribe 1,000+ sermons/month, have an engineering team, and need custom workflows.
Buy if you transcribe under 100 sermons/month and want the cleanup work and infrastructure handled.

Sermon-transcription.com uses Whisper under the hood at the same $0.006/min OpenAI rate, plus a 30-second handoff to SRT/VTT formatting, scripture-reference extraction, and a CMS-friendly UI. For churches not running their own engineering, the time savings is the point.

Webhook Pattern for Production Workflows

If you're building a pipeline that auto-transcribes every Sunday's audio, the recommended pattern:

Audio is uploaded to S3 or R2 via your livestream gear.
An S3 event notification triggers a Lambda/Worker.
The Lambda calls Whisper API with the audio file.
The Lambda writes the resulting transcript, SRT, and VTT back to S3.
A second event triggers your CMS to publish the blog post draft.

End-to-end: from sermon ending → blog post draft published, under 10 minutes.

Error Handling Gotchas

Rate limits. Whisper API has request-per-minute limits. Implement exponential backoff.
Timeouts. Default fetch timeouts often kill long requests. Set timeout to 5+ minutes.
Non-English audio. Specify the language parameter for higher accuracy on non-English sermons.
Music sections. Whisper sometimes hallucinates words during purely instrumental sections. Strip music before transcription if possible.

Conclusion

OpenAI Whisper has democratized sermon transcription. For under $15/year a church can transcribe weekly sermons with professional-grade accuracy. Whether you build the pipeline yourself or use a managed service, Whisper is the foundation modern church-tech runs on.

If you want to skip the engineering and get accurate transcripts in 5 minutes per sermon, try sermon-transcription.com free — first 10 minutes are on us.

Frequently Asked Questions

Ready to transcribe your sermons?

Try it free — transcribe up to 5 minutes at no cost. See the quality for yourself.

Start Free Transcription

No credit card required

Back to Blog