How the cleaner works
- Tokenize and scan. We walk the text with word-boundary regular expressions that match each filler family without touching words that contain those letters (so "lumber" is safe from "um", "hugh" is safe from "uh").
- Contextual "like" detection. The word "like" is only a filler in specific positions — between commas, after a sentence-starting comma, or in the comma-pause pattern "It was, like, really good." We only strip those, leaving meaningful uses like "love like Christ" intact.
- Repeat collapsing and cleanup. Consecutive duplicate words ("the the the") are reduced to one, double spaces are normalized, stray punctuation is fixed, and the first letter of every sentence is re-capitalized.
Why a clean transcript matters
Raw machine transcripts are notoriously messy. Even strong models like Whisper preserve every "um", "uh", and false-start because that's what was actually said. For a literal court reporter, that fidelity is the point. For a pastor turning Sunday's sermon into a blog post, weekly email, or printable Bible study guide, those filler words become friction. Studies of digital reading behavior consistently show that filler-heavy prose causes readers to bounce: average dwell time drops 30 to 40 percent when content reads like a verbal recording rather than written prose.
This cleaner uses a conservative, rule-based approach rather than a language model, which means three things. First, the output is deterministic — paste the same text twice, get identical results. Second, it's fast — even a 10,000-word manuscript processes in under a tenth of a second. Third, it's private — nothing leaves your browser, nothing is logged, nothing is uploaded to a third party. The cleaner is especially useful as a pre-processing step before sending a transcript to ChatGPT, Claude, or a human editor: stripping the noise first dramatically reduces token usage and lets the next stage focus on substance.
Most preachers find that they can cut between 8% and 15% of total word count by removing fillers and repeats — a 40-minute sermon at 135 WPM yields roughly 5,400 spoken words, of which 450 to 800 are typically filler. That's a full printed page of clutter that, once removed, reveals tighter, more publishable prose underneath. Pair this with the Readability Analyzer to see the grade-level improvement after cleaning.
Related tools
- Sermon Readability Analyzer — measure grade level before and after cleaning.
- Sermon Word Counter — count words, characters, and reading time.
- SRT to Text — strip timecodes before cleaning.
- Sermon Tag Cloud — visualize the dominant themes in a cleaned transcript.