I built a real-time pipeline that reads game subtitles and converts them into dynamic voice acting (OCR → TTS → RVC) [P]
I’ve been experimenting with real-time pipelines that combine OCR + TTS + voice conversion, and I ended up building a desktop app that can “voice” game subtitles dynamically.
The idea is simple: – Capture subtitles from screen (OCR) – Convert them into speech (TTS) – Transform the voice per character (RVC)
But the hard parts were: – Avoiding repeated subtitle spam (similarity filtering) – Keeping latency low (~0.3s) – Handling multiple characters with different voice models without reloading – Running everything in a smooth pipeline (no audio gaps)
One thing that helped a lot was using a two-stage pipeline: While one sentence is playing, the next one is already processed in the background.
I also experimented with: – Emotion-based voice changes – Real-time translation (EN → TR) – Audio ducking (lowering game sound during speech)
I’m curious: How would you approach reducing latency further in a multi-model setup like this? Or is there a better alternative to RVC for real-time character voice conversion?
Happy to share more technical details if anyone is interested.
submitted by /u/fqtih0
[link] [comments]