[D] How should I fine-tune an ASR model for multilingual IPA transcription?

Hi everyone!

I’m working on a project where I want to build an ASR system that transcribes audio into IPA, based on what was actually said. The dataset is multilingual.

Here’s what I currently have:

– 36 audio files with clear pronunciation + IPA

– 100 audio files from random speakers with background noise + IPA annotations

My goal is to train an ASR model that can take new audio and output IPA transcription.

I’d love advice on two main things:

  1. What model should I start with?

  2. How should I fine-tune it?

Thank you.

submitted by /u/Routine-Ticket-5208
[link] [comments]

Liked Liked