easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]
|
I have built Having worked with preprocessing hundreds of thousands of hours of audio and text for training speech-to-text models, I found that the available open source forced alignment libraries often missed some convenience features. For our purposes it was, in particular, important for the tooling to be able to:
The documentation has tutorials for different alignment scenarios, and for custom text processing. The aligned outputs can be segmented at any level of granularity (sentence, paragraph, etc.), while preserving the original text’s formatting. The forced alignment backend uses Pytorch’s forced alignment API with a GPU based implementation of the Viterbi algorithm. It’s both fast and memory-efficient, handling hours of audio/text in one pass without the need to chunk the audio. I’ve adapted the API to support emission extraction from all wav2vec2 models on Hugging Face Hub. You can force align audio and text in any language, as long as there’s a w2v2 model on HF Hub that can transcribe the language.
The documentation: https://kb-labb.github.io/easyaligner/ submitted by /u/mLalush |