Audio multimodal AI: the models that understand sound, not just transcribe it
A curated resource list of multimodal AI models with native audio support — models that process audio tokens, not just transcribe.
Page 6 of 11 · 132 posts total
A curated resource list of multimodal AI models with native audio support — models that process audio tokens, not just transcribe.
Comparing 8 STT models on a 27-minute podcast. Local Whisper wins on word accuracy, but cloud APIs dominate punctuation.
A short curated list of the best Whisper fine-tuning resources: tutorials, notebooks, and managed compute examples.
Evaluating whether fine-tuning Whisper improves transcription accuracy. Spoiler: it depends on model size and use case.
A script for fine-tuning OpenAI's Whisper speech recognition models using Modal's serverless GPU infrastructure.
A voice-controlled Linux virtual keyboard using Deepgram's Flux turn-taking STT API, built in Rust.
A GUI tool for collecting audio training data for ASR fine-tuning, with LLM-generated prompts and Hugging Face integration.
An MCP server for audio transcription using multimodal LLMs like Gemini, GPT-4o Audio, and Voxtral — not traditional ASR.
An MCP server that brings Gemini-powered audio transcription directly into Claude Code and Claude Desktop.
A desktop transcription app that sends audio directly to multimodal AI models for single-pass transcription and formatting.
A local voice typing app for Linux/Wayland using NVIDIA's Parakeet model. No cloud, no GPU, built-in punctuation.
A snapshot comparing Hebrew TTS quality across six providers, including voice cloning experiments via Replicate.