Sermon Transcription Glossary

A

AAC

File Format

Also called: MP4 audio, .m4a (when in MPEG-4 container)

Advanced Audio Coding — a lossy audio compression format used by Apple and many livestream platforms.

AAC (Advanced Audio Coding) is a lossy compressed audio format that delivers better sound quality than MP3 at the same bitrate. Many church livestream encoders (including those built into Macs, iOS devices, and modern OBS setups) produce AAC by default. AAC files transcribe just as accurately as MP3 or WAV in modern AI transcription pipelines.

Related:MP3 WAV Lossy Compression

Accuracy (Transcription Accuracy)

Quality

The percentage of words correctly transcribed compared to a verified ground-truth transcript.

Transcription accuracy is most often expressed as a percentage (e.g., 99% accurate) or as the inverse of Word Error Rate (WER). A 99% accurate transcript of a 5,000-word sermon will contain roughly 50 errors. Accuracy depends heavily on audio quality, accents, background music, theological vocabulary, and the model used. AI services like OpenAI Whisper achieve 95–99% on clean sermon audio.

ADA Compliance

Accessibility

Conformance with the Americans with Disabilities Act, which courts increasingly extend to church websites and online content.

While churches are exempt from many ADA requirements as religious organizations, their public websites may still be subject to lawsuits under Title III if courts consider them places of public accommodation. Closed captions on sermon videos and text transcripts on the website significantly reduce legal exposure and, more importantly, make ministry accessible to the 48 million Americans with hearing loss.

AI Transcription

Method

Automated speech-to-text powered by machine learning models like OpenAI Whisper or ElevenLabs.

AI transcription uses deep neural networks trained on millions of hours of speech to convert audio into text without human typists. Modern AI transcription is 50–100x faster than human transcription and roughly 250x cheaper. For sermons recorded with a clear lapel mic, accuracy now rivals professional human transcribers — typically 97–99.5%.

ASR (Automatic Speech Recognition)

Technology

Also called: Speech-to-Text, STT, Voice Recognition

The technical name for the machine-learning systems that convert speech to text.

ASR (Automatic Speech Recognition) is the engineering term for software that turns spoken audio into written text. Modern ASR systems use transformer-based neural networks (the same family of models behind ChatGPT) and have surpassed human transcription accuracy on clean audio in 2022 and beyond. OpenAI Whisper, Google Cloud Speech-to-Text, and AWS Transcribe are leading ASR engines.

B

Bitrate

Audio

The amount of data per second in an audio file, measured in kbps (kilobits per second).

Higher bitrate generally means better audio quality and larger file size. For sermon transcription, anything above 64 kbps mono is sufficient — Whisper actually downsamples audio to 16 kHz internally, so ultra-high bitrates do not improve accuracy. 128 kbps stereo MP3 is the sweet spot for podcast distribution and transcription input alike.

Related:Sample Rate Compression MP3

Burned-In Captions

Captions

Also called: Open Captions, Hardcoded Captions

Captions permanently rendered into the video frame, also called 'open captions'.

Burned-in (or open) captions are part of the video file itself and cannot be toggled off. They are common in Reels, TikTok, and Shorts because the algorithms (and silent autoplay) rely on visible text. The downside: viewers can't translate them, and they hurt the watch experience for hearing audiences. For long-form sermons, use closed captions (SRT/VTT) instead.

Related:Closed Captions SRT VTT

C

Captions

On-screen text that conveys the spoken dialogue plus relevant sound effects for accessibility.

Captions are designed for viewers who cannot hear the audio — they include speaker labels and non-speech audio (e.g., '[congregation laughing]', '[worship music]'). This differs from subtitles, which assume the viewer can hear and only translate dialogue. Closed captions can be toggled on or off; open captions are burned into the video.

CCLI Licensing

Legal

Christian Copyright Licensing International — the standard licensing framework for worship music in churches.

CCLI licenses cover music reproduction, projection, and streaming. While CCLI doesn't directly govern sermon transcripts, transcribed worship lyrics typically require separate reporting under your CCLI Streaming License. The sermon text itself — being a pastor's own work — is owned by the pastor or the church (depending on employment agreements).

Related:Copyright Worship Lyrics

Closed Captions

Captions

Also called: CC

Captions that viewers can toggle on or off, typically delivered as a separate SRT or VTT file.

Closed captions are not burned into the video — they are encoded as a separate text track (sidecar file) or embedded in the video container. YouTube, Vimeo, and Facebook all support closed captions via SRT/VTT upload. Always prefer closed captions for long-form sermon video so deaf viewers benefit while hearing viewers aren't forced to read.

Related:SRT VTT Open Captions Captions

Codec

Audio

Software that encodes (compresses) or decodes audio/video — e.g., MP3, AAC, Opus, H.264.

Codecs determine how audio is compressed and stored. Lossy codecs (MP3, AAC, Opus) discard inaudible data for smaller files; lossless codecs (FLAC, ALAC, WAV) preserve every bit. For transcription, any codec works as long as it's intelligible. For archiving the master recording, use WAV or FLAC.

Related:MP3 WAV FLAC AAC

D

Diarization (Speaker Diarization)

Technology

Also called: Speaker Separation, Speaker Labeling

The process of automatically identifying and labeling 'who spoke when' in an audio recording.

Diarization solves the question 'which words belong to which speaker?' It clusters audio segments by voice characteristics and assigns each cluster a label (Speaker 1, Speaker 2, etc.) — without knowing who anyone actually is. Diarization is essential for panel sermons, Q&A sessions, and interviews. ElevenLabs Audio Intelligence performs diarization natively; OpenAI Whisper does not, though it can be paired with pyannote.audio for the same effect.

Discussion Guide

Ministry Output

A small-group resource derived from a sermon, containing scripture references, questions, and application prompts.

A discussion guide is one of the highest-impact downstream products of a sermon transcript. It typically includes 5–7 open-ended questions, the key scripture passages, a 'this week' application challenge, and a closing prayer prompt. AI tools can generate a competent first draft from a transcript in under a minute.

E

ElevenLabs Audio Intelligence

Service

An AI audio platform offering 99.5%+ accuracy transcription with native speaker diarization, entity detection, and word-level timestamps.

ElevenLabs Audio Intelligence is the premium tier underlying many modern church transcription services. It combines exceptional accuracy with built-in diarization (no second tool required) and structured outputs (named entities, sentiment, language detection). Best for multi-speaker recordings, podcast interviews, and panel discussions. Cost: typically $0.02/minute when reselled.

Entity Detection (Named Entity Recognition)

Technology

Also called: NER, Named Entity Recognition

Automatic identification of people, places, scripture references, and organizations within a transcript.

Named Entity Recognition (NER) flags structured information in unstructured text. For sermons, this typically means extracting scripture references (John 3:16), book and author citations, geographic locations (Jerusalem, Galilee), and people (Paul, Moses, modern figures referenced). NER is what powers 'jump to the scripture reading' features in modern sermon search.

Related:Diarization Scripture Reference

F

FLAC

File Format

Free Lossless Audio Codec — a compressed audio format that preserves 100% of the original quality.

FLAC files are typically 50–60% the size of equivalent WAV files but contain identical audio data. FLAC is ideal for archiving master sermon recordings: smaller than WAV, perfect quality, royalty-free. Both Whisper and ElevenLabs accept FLAC directly.

Related:WAV MP3 Lossless Compression

G

Gain Staging

Audio

Setting microphone and mixer levels so the loudest moments stay below 0 dB without clipping.

Proper gain staging is the single biggest factor in transcribable sermon audio. Aim for peaks around -6 dB to -3 dB, with average levels around -18 dB. Clipped audio (signal exceeding 0 dB) introduces distortion that no transcription model can recover from. A lapel mic with correct gain dramatically outperforms a noise-cancelled high-end mic with poor levels.

H

Hard of Hearing

Accessibility

Also called: HoH, Hearing Impaired (older term, generally avoided)

Mild to severe hearing loss that nonetheless allows some auditory perception, distinct from 'Deaf'.

Approximately 15% of American adults (37.5 million people) are deaf or hard of hearing. Most prefer the term 'hard of hearing' when their loss is mild to severe but not total. Closed captions and text transcripts serve both communities, but the hard-of-hearing community often benefits more from accurate captions paired with audio than from text alone.

Human Transcription

Method

Transcripts produced by trained human typists, typically achieving 99%+ accuracy for $1.00–$1.50 per audio minute.

Human transcription remains the gold standard for nuanced content with heavy theological vocabulary, multiple speakers with similar voices, or poor audio quality. Major providers include Rev.com, GoTranscript, and Scribie. For routine Sunday sermons with clear audio, modern AI transcription matches human accuracy at 1/250th the cost.

Hybrid Transcription

Method

AI-generated first-draft transcripts edited by humans — the best balance of accuracy and cost for high-stakes content.

Hybrid workflows take a 99% accurate AI draft (5 minutes, $0.27) and have a human editor correct the remaining 1% (typically 15–30 minutes of editing, $5–10 in labor). Total cost stays under $10/sermon while accuracy approaches 99.9%. This is the recommended workflow for published sermon archives, books, and accredited Bible study curricula.

K

Keywords (SEO Keywords)

SEO

The search queries you want your sermon transcript pages to rank for in Google.

Keywords for sermon content are typically topical ('what does the Bible say about forgiveness'), scriptural ('John 3:16 explained'), or seasonal ('Easter sermon outline'). Long-tail keywords (4+ words) have lower search volume but much higher intent. A single sermon transcript can rank for hundreds of long-tail keywords once published.

L

Lapel Microphone (Lavalier)

Equipment

Also called: Lav Mic, Lavalier, Clip-on Mic

A small clip-on microphone worn near the speaker's collar, ideal for sermon recording.

Lapel mics deliver the cleanest sermon audio because they stay 6–12 inches from the speaker's mouth regardless of pulpit movement. Wireless lavaliers from Rode, Shure, or Sennheiser ($150–$600) outperform $5,000 ceiling-mounted mics for transcription purposes. Position the capsule 6 inches below the chin, slightly off-center to avoid breath sounds.

Related:Audio Quality Microphone

Llms.txt

AI/LLM

Also called: LLMs.txt

A proposed standard markdown file at /llms.txt that summarizes a website for large language models.

Llms.txt is to LLMs what robots.txt is to crawlers and sitemap.xml is to search engines: a single curated entry point. Located at /llms.txt, it provides an LLM-friendly summary of your site, key pages, products, and policies. Sermon-transcription.com publishes both /llms.txt (summary) and /llms-full.txt (full content). Helps your content appear correctly in ChatGPT, Claude, Perplexity, and Google AI Overviews.

Lossy Compression

Audio

Audio compression that permanently discards data deemed inaudible to reduce file size.

MP3, AAC, and Opus are all lossy. Once audio is encoded lossy, the discarded information cannot be recovered — re-encoding a lossy file repeatedly degrades quality. For transcription, this rarely matters because Whisper downsamples everything to 16 kHz. For archival, always keep one lossless master (WAV or FLAC).

Related:MP3 AAC FLAC WAV

M

MP3

File Format

The most common lossy audio format on the web — universally supported, modest file sizes.

MP3 (MPEG-1 Audio Layer III) became the de facto digital audio standard in the late 1990s. For sermon transcription, 64 kbps mono MP3 is more than enough — anything higher just wastes storage with no accuracy gain. MP3 patents fully expired in 2017, making it royalty-free.

Related:AAC WAV Bitrate

MP4

File Format

A video container format that can hold audio, video, subtitles, and metadata.

MP4 is a container, not a codec. Most MP4 video files contain H.264 video and AAC audio. Modern transcription services accept MP4 directly — they automatically extract the audio track. There's no need to convert sermon video to MP3 before transcribing.

Related:MOV MP3 AAC H.264

N

Neural Network

Technology

The mathematical structure underlying modern AI transcription, loosely modeled on biological neurons.

ASR models like Whisper use transformer neural networks with hundreds of millions of parameters, trained on tens of thousands of hours of audio paired with transcripts. Unlike older rule-based speech recognition (which struggled with accents and noise), neural networks learn to handle real-world conditions by example.

Related:ASR Whisper Machine Learning

O

Open Captions

Captions

Also called: Burned-In Captions, Hardcoded Captions

Captions permanently rendered into the video — the same as burned-in captions.

Open captions appear by default and cannot be turned off. They're best for short social clips where vertical autoplay-muted viewing dominates (TikTok, Reels). For long-form YouTube sermons, prefer closed captions so viewers can toggle them.

OpenAI Whisper

Service

An open-source speech-recognition model from OpenAI achieving near-human accuracy on 90+ languages.

Released in 2022, Whisper revolutionized transcription cost economics: 99% accuracy at $0.006/minute via OpenAI's API, or free if you run it locally. Whisper handles accents, technical vocabulary, and moderate background noise gracefully. Its main limitations are speaker diarization (not built in) and live streaming (designed for batch processing).

Related:ElevenLabs ASR AI Transcription

P

Podcast

Distribution

An on-demand audio program distributed via RSS feed and listened to on apps like Apple Podcasts and Spotify.

Most churches' sermon archives are technically podcasts already. Submitting the RSS feed to Apple Podcasts, Spotify, Google Podcasts, and Amazon Music multiplies reach by 5–10x with no extra production work. Transcripts published alongside each episode (or in podcast show notes) boost SEO and accessibility.

R

Real-Time Transcription

Method

Also called: Live Captioning, Streaming Transcription

Speech-to-text generated live during the sermon, useful for on-screen captions.

Real-time (streaming) transcription differs from batch transcription in latency: it must produce text within ~1–2 seconds of being spoken. Real-time accuracy is typically 2–5% lower than post-event batch transcription because the model can't see the full context. Use it for live captions; use batch for archive transcripts.

Repurposing

Content Strategy

Also called: Content Multiplication

The practice of converting one piece of content (a sermon) into many derivative pieces (blog, social, email).

A 45-minute sermon transcript can be repurposed into: 1 SEO blog post, 5–10 social media graphics, 1 weekly email devotional, a 5-day devotional series, a small group discussion guide, multiple 60-second vertical clips, and a printable handout. Repurposing is the single highest-ROI activity for under-resourced church communications teams.

S

Sample Rate

Audio

How many times per second an audio waveform is measured, expressed in kHz.

Standard sample rates are 44.1 kHz (CD quality), 48 kHz (broadcast/video standard), and 16 kHz (speech-optimized). Whisper internally downsamples to 16 kHz regardless of input — so a 48 kHz studio recording transcribes identically to a 16 kHz phone recording in terms of model performance, provided the audio is otherwise clear.

Related:Bitrate Audio Quality

Schema Markup (JSON-LD)

SEO

Structured data embedded in a webpage that tells search engines and LLMs exactly what the page is about.

Schema.org markup in JSON-LD format powers Google's rich results (FAQ snippets, How-To carousels, breadcrumbs, video carousels). For sermon pages, the most valuable schemas are Article, FAQPage, HowTo, BreadcrumbList, VideoObject, and DefinedTerm. Schema is what makes content show up in AI Overviews — it's the single highest-leverage technical SEO investment.

Related:SEO AI Overviews Rich Results

Scripture Reference

Ministry Output

A citation of a Bible passage in 'Book Chapter:Verse' form (e.g., John 3:16, Romans 8:28–30).

Scripture references are the highest-value structured data in any sermon. Modern AI can extract them automatically, linking each citation to a Bible API for inline display. A transcript with linked scripture references serves Bible-app users, drives SEO via long-tail Bible queries, and dramatically improves time-on-page.

Related:Entity Detection Bible API

Sermon Archive

Distribution

A searchable, organized collection of a church's sermons spanning months or years.

Transcribed sermon archives convert a church website from a brochure into a research library. Members find past messages; first-time visitors search by topic, scripture, or pastor; Google indexes everything as long-tail organic traffic. Major archives like Grace To You (John MacArthur), Desiring God (John Piper), and Spurgeon Gems demonstrate the SEO power of decades of indexed text.

Related:Sermon Library SEO

Sermon Clips

Content Strategy

Also called: Sermon Shorts, Sermon Reels

30–90 second vertical-video excerpts pulled from a full sermon for Reels, TikTok, and Shorts.

Sermon clips are the social-media equivalent of a movie trailer. The best clips share one self-contained idea, open with a hook in the first 3 seconds, and end with a question or call to read more. Transcripts make finding clip-worthy moments dramatically faster — Cmd-F through 5,000 words beats scrubbing 45 minutes of video.

Related:Repurposing Burned-In Captions

Sermon Transcription

Core

The process of converting recorded sermon audio (or video) into written, formatted text.

Sermon transcription is the conversion of preached audio into searchable, readable, indexable text. It serves accessibility (deaf and hard-of-hearing members), discoverability (Google can't index audio), content repurposing (blogs, social, email), and member engagement (note-taking, study, sharing). In 2026, AI transcription has made the process near-instant ($0.27 per sermon, 5 minutes turnaround).

Speaker ID

Technology

Identifying which named individual (not just 'Speaker 1') is speaking in a recording.

Speaker ID extends diarization by mapping anonymous speaker clusters to known voices via voice fingerprinting. Set it up by providing a few labeled samples ('this is Pastor John,' 'this is Worship Leader Sarah'); the model then labels every future recording automatically. ElevenLabs and AssemblyAI support speaker ID; OpenAI Whisper does not.

SRT (SubRip Subtitle File)

File Format

The most common subtitle file format, a plain text file with timecodes and dialogue.

SRT files are tiny plain-text files containing numbered subtitle blocks, each with a start/end timecode and the dialogue text. YouTube, Vimeo, Facebook, and most video players accept SRT directly. Every sermon transcription service worth using exports SRT alongside plain text.

Related:VTT Captions Closed Captions

Subtitles

Captions

On-screen text translating dialogue, intended for viewers who can hear but don't speak the language.

Subtitles assume hearing — they don't include speaker labels or sound effects like '[congregation singing]'. They're typically used for translation (Spanish subtitles on an English sermon, for instance). Captions (the broader term) are for accessibility; subtitles are for translation. Modern AI services produce both from one source transcript.

T

Timestamp

Technology

A marker in a transcript indicating when each word or phrase was spoken in the source audio.

Timestamps come in two flavors: segment-level (every 5–30 seconds) and word-level (every individual word). Word-level timestamps power karaoke-style captions, exact-quote linking ('click to play this sentence'), and rapid clip generation. ElevenLabs and Whisper API both return word-level timestamps; check that your provider does the same.

V

VTT (Web Video Text Tracks)

File Format

The HTML5-native subtitle format used by browser <video> tags, similar to SRT but slightly richer.

VTT (WebVTT) is the official caption format of the HTML5 video specification. It supports styling (italics, positioning, colors), unlike SRT. Most modern church websites that self-host sermon video should serve VTT captions; for YouTube uploads, SRT is the safer choice.

Related:SRT Captions HTML5 Video

W

WAV

File Format

An uncompressed audio format storing raw PCM data — the gold standard for masters.

WAV files preserve 100% of the recorded audio with no compression. A 45-minute sermon in 16-bit/44.1 kHz stereo WAV weighs about 450 MB. Use WAV for the original master file you archive; transcribe from a smaller MP3 or AAC copy to save upload time without losing accuracy.

Related:FLAC MP3 Lossless

Whisper

Service

Shorthand for OpenAI Whisper — the leading open-source ASR model.

See OpenAI Whisper.

Related:OpenAI Whisper ASR

Word Error Rate (WER)

Quality

Also called: WER

The standard metric for transcription accuracy — the percentage of words that are wrong.

Word Error Rate counts insertions, deletions, and substitutions as a percentage of the reference transcript's total word count. WER of 5% means 5 out of every 100 words is wrong, which equals 95% accuracy. Modern AI sermon transcription typically achieves WER between 1% and 5% on clean audio.

Related:Accuracy ASR

Word-Level Timestamps

Technology

A timestamp attached to every individual word in a transcript, not just every sentence or segment.

Word-level timestamps enable interactive transcripts (click any word to jump to that moment), karaoke captions, automatic clip extraction, and exact-quote linking. They cost slightly more to produce and store than segment-level timestamps but unlock dramatically richer downstream applications.

Related:Timestamp Diarization

Jump to Letter

A

AAC

Accuracy (Transcription Accuracy)

ADA Compliance

AI Transcription

ASR (Automatic Speech Recognition)

B

Bitrate

Burned-In Captions

C

Captions

CCLI Licensing

Closed Captions

Codec

D

Diarization (Speaker Diarization)

Discussion Guide

E

ElevenLabs Audio Intelligence

Entity Detection (Named Entity Recognition)

F

FLAC

G

Gain Staging

H

Hard of Hearing

Human Transcription

Hybrid Transcription

K

Keywords (SEO Keywords)

L

Lapel Microphone (Lavalier)

Llms.txt

Lossy Compression

M

MP3

MP4

N

Neural Network

O

Open Captions

OpenAI Whisper

P

Podcast

R

Real-Time Transcription

Repurposing

S

Sample Rate

Schema Markup (JSON-LD)

Scripture Reference

Sermon Archive

Sermon Clips

Sermon Transcription

Speaker ID

SRT (SubRip Subtitle File)

Subtitles

T

Timestamp

V

VTT (Web Video Text Tracks)

W

WAV

Whisper

Word Error Rate (WER)

Word-Level Timestamps

Ready to put these terms to work?

Continue Learning

The Complete Guide to Sermon Transcription

Best Sermon Transcription Services Compared

Free Sermon Tools