[P] Fine-tuned Whisper-small for digit-specific transcription (95% accuracy)
**Project:** EchoEntry – Digit-optimized speech recognition API
**Link:** https://echoentry.ai
**Model:** Whisper-small fine-tuned on numeric dataset
**Motivation:**
Generic ASR models struggle with numbers – “105” vs “15” ambiguity, inconsistent formatting, poor accuracy on short digit sequences.
**Approach:**
– Base model: Whisper-small (1.7GB)
– Training data: TTS-generated + voice recordings (1-999, 5 accents)
– Task: Forced numeric transcription with digit extraction
– Deployment: FastAPI on 8GB CPU (no GPU needed for inference)
**Results:**
– 95-99% accuracy on 1-3 digit numbers
– Sub-second inference on CPU
– Handles multiple English accents (US, UK, Irish, Australian, Canadian)
**Try it:**
“`bash
curl -O https://echoentry.ai/test_audio.wav
curl -X POST https://api.echoentry.ai/v1/transcribe
-H “X-Api-Key: demo_key_12345”
-F “file=@test_audio.wav;type=audio/wav”
“`
**Technical details:**
– Used librosa/FFmpeg for audio preprocessing
– Trim silence (top_db=35) before inference
– Greedy decoding (num_beams=1) for speed
– Forced decoder IDs for English transcription task
**Challenges:**
– Browser audio quality vs native recordings (huge gap)
– Model works great, but web deployment had accuracy issues
– Pivoted to API so devs handle audio capture their way
**Code/model:** Currently closed (exploring validation), but happy to discuss approach.
submitted by /u/YoungBig676
[link] [comments]