I built a real-time pipeline that reads game subtitles and converts them into dynamic voice acting (OCR → TTS → RVC) [P]

digitado ⋅ 26 de March de 2026

I’ve been experimenting with real-time pipelines that combine OCR + TTS + voice conversion, and I ended up building a desktop app that can “voice” game subtitles dynamically.

The idea is simple: – Capture subtitles from screen (OCR) – Convert them into speech (TTS) – Transform the voice per character (RVC)

But the hard parts were: – Avoiding repeated subtitle spam (similarity filtering) – Keeping latency low (~0.3s) – Handling multiple characters with different voice models without reloading – Running everything in a smooth pipeline (no audio gaps)

One thing that helped a lot was using a two-stage pipeline: While one sentence is playing, the next one is already processed in the background.

I also experimented with: – Emotion-based voice changes – Real-time translation (EN → TR) – Audio ducking (lowering game sound during speech)

I’m curious: How would you approach reducing latency further in a multi-model setup like this? Or is there a better alternative to RVC for real-time character voice conversion?

Happy to share more technical details if anyone is interested.

submitted by /u/fqtih0
[link] [comments]

Like 0

Liked Liked