[P] Fine-tuned Whisper-small for digit-specific transcription (95% accuracy)

**Project:** EchoEntry – Digit-optimized speech recognition API

**Link:** https://echoentry.ai

**Model:** Whisper-small fine-tuned on numeric dataset

**Motivation:**

Generic ASR models struggle with numbers – “105” vs “15” ambiguity, inconsistent formatting, poor accuracy on short digit sequences.

**Approach:**

– Base model: Whisper-small (1.7GB)

– Training data: TTS-generated + voice recordings (1-999, 5 accents)

– Task: Forced numeric transcription with digit extraction

– Deployment: FastAPI on 8GB CPU (no GPU needed for inference)

**Results:**

– 95-99% accuracy on 1-3 digit numbers

– Sub-second inference on CPU

– Handles multiple English accents (US, UK, Irish, Australian, Canadian)

**Try it:**

“`bash

curl -O https://echoentry.ai/test_audio.wav

curl -X POST https://api.echoentry.ai/v1/transcribe

-H “X-Api-Key: demo_key_12345”

-F “file=@test_audio.wav;type=audio/wav”

“`

**Technical details:**

– Used librosa/FFmpeg for audio preprocessing

– Trim silence (top_db=35) before inference

– Greedy decoding (num_beams=1) for speed

– Forced decoder IDs for English transcription task

**Challenges:**

– Browser audio quality vs native recordings (huge gap)

– Model works great, but web deployment had accuracy issues

– Pivoted to API so devs handle audio capture their way

**Code/model:** Currently closed (exploring validation), but happy to discuss approach.

Docs: https://echoentry.ai/docs.html

submitted by /u/YoungBig676
[link] [comments]

Liked Liked