End-to-End ASR Conformers: Revolutionizing Hearing-to-Speech-to-Writing Language Processing Frameworks
This paper introduces a novel end-to-end framework leveraging Conformer architectures to unify the traditionally fragmented pipeline of hearing-to-speech-to-writing language processing. Unlike conventional automatic speech recognition (ASR) systems that cascade separate acoustic, phonetic, and linguistic models prone to cascading errors our approach employs stacked Conformer encoders, which integrate convolution-augmented transformers to capture both local spectral nuances and long-range contextual dependencies in raw audio inputs. The model processes mel-spectrograms directly into intermediate speech representations and final textual outputs via a […]