Benchmarking speech-to-text on long-form audio
Comparing 8 STT models on a 27-minute podcast. Local Whisper wins on word accuracy, but cloud APIs dominate punctuation.
I wanted to answer a simple question: how do different speech-to-text services actually perform on real-world long-form audio? Not short clips, not carefully recorded studio audio, but a real 27-minute podcast episode with natural speech patterns, filler words, and the kind of conversational flow that tends to trip up transcription systems. So I set up a systematic evaluation: 8 model configurations (3 local, 5 cloud), one professional human transcription as the ground truth reference, and metrics covering both word accuracy and punctuation fidelity.
The test setup
The source audio was a 27-minute English podcast, 4,748 words in the professional reference transcript with 688 punctuation marks. On the local side, I tested Whisper-Base with explicit English language setting, Whisper-Base with auto-detection, and Whisper-Tiny. On the cloud side, I evaluated Deepgram Nova-3, AssemblyAI Best, OpenAI's Whisper-1 API, Gladia Solaria-1, and Speechmatics SLAM-1. Word error rate (WER) and character error rate (CER) were calculated using the jiwer library, and I developed a custom punctuation accuracy metric that measures both count and context-aware placement.
The word accuracy surprise
Here's what I didn't expect: local Whisper-Base actually achieved the highest word accuracy of any model tested, hitting 82.48% (17.52% WER). The best cloud performer, Deepgram Nova-3, came in at 81.28% (18.72% WER). That's a gap of only 1.2 percentage points. AssemblyAI Best was right behind at 81.21%, and even OpenAI's own Whisper-1 cloud API scored 80.73%. The spread from best to worst across all 8 configurations was only about 5 percentage points.
This was genuinely surprising. I expected cloud services --- with their proprietary post-processing pipelines, language models, and extensive training data --- to significantly outperform a local Whisper-Base model running on my desktop. For pure word-level transcription, they don't. The gap is trivial.
Punctuation is where the real story lives
But word accuracy only tells half the story. A transcript with all the right words but no punctuation is borderline unusable for publication. And this is where local and cloud models diverge dramatically.
Local Whisper-Base captured only 21.90% of the reference punctuation. It got 16% of periods right, 30% of commas, and zero percent of exclamation marks, quotation marks, and colons. The output reads like a wall of text with occasional pauses. Whisper-Tiny was even worse at 18.78%.
Cloud services told a completely different story. Deepgram Nova-3 hit 51.17% punctuation accuracy with a near-perfect count: 698 punctuation marks versus 688 in the reference. AssemblyAI Best reached 48.43%. Even the lowest cloud performer, Speechmatics SLAM-1, achieved 38.23% --- still nearly double the best local result.
The over-punctuation problem
An interesting pattern emerged in the cloud results: several services significantly over-punctuated. Speechmatics inserted 1,003 punctuation marks versus 688 in the reference --- 46% more than needed. OpenAI's Whisper-1 added 911 marks (32% more), and AssemblyAI added 791 (15% more). Deepgram Nova-3 was the most restrained, producing 698 marks versus the 688 reference, essentially matching the professional transcription's punctuation density.
Over-punctuation is arguably harder to fix than under-punctuation. Removing excess commas and periods requires judgment about sentence structure and flow, while adding punctuation to unpunctuated text can be done more mechanically. This makes Deepgram's restraint particularly valuable for production workflows.
Speed: a decisive cloud advantage
Deepgram Nova-3 processed the entire 27-minute podcast in 3 seconds. That's not a typo. Three seconds for 27 minutes of audio. Local inference, even with GPU acceleration, takes meaningfully longer. For batch processing workflows --- transcribing meeting recordings, podcast archives, interview collections --- this speed difference is the kind of thing that changes whether a workflow is practical or not.
Error distribution analysis
Looking at the error distribution reveals interesting architectural differences. Local Whisper-Base had 83.4% hits, 15.3% substitutions, 1.3% deletions, and 0.9% insertions. It rarely drops or adds words, but when it gets a word wrong, it substitutes something phonetically similar. Deepgram Nova-3 had 82.5% hits but a different error profile: 13.0% substitutions (fewer than local), but 4.5% deletions (significantly more). It's more likely to skip over words entirely but less likely to hallucinate incorrect alternatives. Both approaches have tradeoffs depending on your use case.
Practical recommendations
After running all these benchmarks, here's how I'd choose for different use cases. For research and analysis --- search indexing, content analysis, data extraction --- local Whisper-Base is hard to beat. Highest word accuracy, zero marginal cost, complete privacy. The awful punctuation doesn't matter if you're processing the text programmatically.
For publication-ready transcription --- blog posts, articles, documentation --- Deepgram Nova-3 is the clear winner. Best punctuation accuracy, near-best word accuracy, near-perfect punctuation count, and blazing speed. The transcripts it produces need minimal post-editing compared to anything else I tested.
For a hybrid approach, you could run local Whisper-Base for the raw word accuracy and then use a cloud service or LLM to add punctuation and formatting. This optimizes for word accuracy while reducing per-minute transcription costs.
Limitations and future work
I want to be transparent about the limitations. This was a single audio file evaluation (n=1), English only, in a podcast/conversational format. I didn't evaluate speaker diarization or timestamp accuracy. The punctuation metric is heuristic-based rather than ground-truth validated. Results could look very different with different audio qualities, accents, or recording environments. I'd like to expand this to multiple audio samples across different formats, and add evaluations for diarization accuracy and timestamp precision.
The full evaluation methodology, Python scripts, raw results data, and SVG visualizations are all available in the repo. If you want to run similar benchmarks on your own audio, the scripts are ready to go --- just drop in your reference transcript and inference outputs. Check it out: Long-Form-Audio-Eval on GitHub.
Single shot STT benchmark for long from audio