smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3
Audio AI has had a breakout year. Automatic speech recognition has gotten dramatically better with models like OpenAI’s Whisper variants, NVIDIA’s Parakeet, and Mistral’s Voxtral. Audio understanding stepped forward with models like NVIDIA’s Audio Flamingo 3. Dialogue-grade text-to-speech arrived via Nari Labs’ Dia-1.6B. And Meta shipped the Perception Encoder Audiovisual (PE-AV), a multimodal encoder capable of learning a shared embedding space across audio, video, and text. The frontier has never moved faster. The catch? The practical knowledge required […]