Multimodal Consistency-Guided Reference-Free Data Selection for ASR Accent Adaptation

arXiv:2602.13263v1 Announce Type: new
Abstract: Automatic speech recognition (ASR) systems often degrade on accented speech because acoustic-phonetic and prosodic shifts induce a mismatch to training data, making labeled accent adaptation costly. However, common pseudo-label selection heuristics are largely text-centric (e.g., perplexity (PPL) filtering) and can prefer fluent yet acoustically mismatched hypotheses, leading to error amplification when fine-tuning. To address this, we introduce a multimodal consistency-guided, reference-free data selection pipeline for ASR accent adaptation under a transductive, label-free protocol. The pipeline starts with a target-aware preselection step based on submodular mutual information to improve query relevance and reduce downstream computation. It then generates multiple pseudo-transcriptions per utterance via perturbation-based decoding and scores each hypothesis using two reference-free signals: speech–text alignment in a shared embedding space and predicted word error rate (WER). A simple percentile-based selection rule retains reliable pseudo-labels for fine-tuning while discarding noisy utterances. In an in-domain setting, selecting ~1.5k utterances from a 30k pool achieves 10.91% WER, close to 10.45% obtained using 30k supervised labels. In a cross-domain setting with a mismatched candidate pool, consistency-filtered subsets avoid the degradation caused by unfiltered pseudo-labels under strong accent shift, and matched-hour experiments on a stronger ASR backbone further confirm gains over random sampling and recent selection baselines.

Liked Liked