Does fine-tuning Whisper actually improve accuracy?
Evaluating whether fine-tuning Whisper improves transcription accuracy. Spoiler: it depends on model size and use case.
After building my Whisper fine-tuning pipeline on Modal, the natural next question was: does it actually help? I assumed the answer would be an obvious yes --- you train a model on your specific audio, it gets better at recognizing your speech patterns. Simple, right? Turns out the reality is considerably more nuanced, and the results from my systematic evaluation genuinely surprised me.
The setup
I fine-tuned all five Whisper model variants (tiny, base, small, medium, and large-v3-turbo) on approximately 90 minutes of my own speech audio, then evaluated each fine-tuned model against its original counterpart using a test set of 91 audio samples. The test set included both technical vocabulary (the kind of AI/tech terminology I use daily) and code-switched content (English mixed with Hebrew, which reflects how I actually speak). I measured word error rate (WER) using the jiwer library with standard Levenshtein distance, and ran all inference through whisper.cpp with Vulkan GPU acceleration on my AMD Radeon RX 7700 XT.
The headline result: bigger models got worse
The smaller models all improved after fine-tuning. Tiny went from 7.85% WER to 6.59% (a 16% improvement). Base improved from 5.46% to 5.02% (8% better). Small jumped from 3.96% to 3.45% (13% improvement). So far, so good.
But medium and large both got worse. Medium went from 2.71% to 2.96% WER, and large from 2.77% to 3.01% --- both about a 9% degradation. That's not a rounding error. Fine-tuning actively harmed the larger models' performance on general transcription.
Why did larger models degrade?
I think several factors are at play. First, there's a capacity-versus-data mismatch. Larger models have hundreds of millions more parameters, and with only 90 minutes of training data, the fine-tuning probably introduced more noise than signal. The models essentially learned some idiosyncratic patterns from my single-speaker dataset that didn't generalize well.
Second, the larger models were already well-optimized for the kind of content I was testing. Their massive pre-training already included technical vocabulary and clear English speech. Fine-tuning on a narrow dataset may have actually narrowed their generalization capabilities --- the classic problem of catastrophic forgetting, where updating a neural network on new data causes it to lose previously learned information.
The practical implication is clear: don't assume fine-tuning will help. For larger models, you likely need significantly more training data --- hours, not minutes --- to see benefits without degradation.
The code-switching exception: dramatic improvement everywhere
Here's where the story gets really interesting. Code-switching --- mixing two languages mid-sentence, like when I drop Hebrew words into otherwise English speech ("Can you send me the document b'vakasha?") --- improved dramatically across every single model size after fine-tuning. The improvements were staggering: the small model improved by 60%, going from 6.69% to 2.71% WER on code-switched content. Large improved by 62%, dropping from 5.00% to 1.92%. Even tiny and base each improved by 26%.
This makes intuitive sense when you think about it. Pre-trained Whisper models are overwhelmingly trained on monolingual data. They've rarely seen examples of someone switching from English to Hebrew mid-sentence, so they tend to force English interpretations of foreign words, producing garbled nonsense. Even a small amount of code-switched training data teaches the model that this pattern exists and is legitimate. The acoustic adaptation --- learning how a speaker's pronunciation shifts when transitioning between languages --- provides valuable signal that the pre-trained models simply never had.
This finding has broad implications. If you're building ASR for bilingual communities --- Spanglish speakers in the US, Hinglish speakers in India, any context where people naturally mix languages --- fine-tuning on even a modest amount of code-switched data could yield massive improvements.
Technical vocabulary: mixed results
The pattern for technical vocabulary mirrored the overall results. Smaller models improved: base went from 6.65% to 5.29% (20% better), and tiny improved by 10%. But medium degraded by a painful 46%, going from 2.02% to 2.95%. Large degraded by 27%. The larger models had already internalized technical AI terminology from their pre-training data, and fine-tuning on a small single-speaker dataset actually made them worse at recognizing those terms.
Practical decision matrix
After running all these evaluations, here's how I'd recommend thinking about fine-tuning. For edge or mobile deployment where you need a small model, fine-tune the small variant --- it hits the best accuracy-to-speed tradeoff with a 13% improvement over the original. For maximum accuracy on standard English content, just use the original medium model at 2.71% WER and don't fine-tune at all. For bilingual or code-switched content, fine-tune aggressively regardless of model size. And if you have limited fine-tuning data (under two hours), stick to the smaller models where the parameter count is proportional to what you're giving it.
Limitations and what's next
I want to be upfront about limitations. This was a proof-of-concept with about 90 minutes of single-speaker training data. The test set was 91 samples. These results would likely shift with more training data, hyperparameter optimization, multi-speaker datasets, or data augmentation. The evaluation ran on a specific hardware setup (AMD Radeon RX 7700 XT with Vulkan), so inference timing results may differ on other hardware. What I'm most interested in exploring next is whether increasing training data to several hours would flip the results for medium and large models, and whether LoRA-based fine-tuning (modifying fewer parameters) might avoid the catastrophic forgetting problem entirely.
All the evaluation code, fine-tuned models, and datasets are open source. The fine-tuned models are on my HuggingFace profile, and the full evaluation framework is here: Whisper-Fine-Tune-Accuracy-Eval on GitHub.
danielrosehill/Whisper-Fine-Tune-Accuracy-Eval View on GitHub