Fake Voice Detection: A Comparative Analysis of Complex-Valued Deep Learning and Transformer Models across Multiple Languages

The rapid progress of modern text-to-speech (TTS) systems has led to synthetic voices that are increasingly indistinguishable from real human speech, raising serious concerns for security, audio forensics, and biometric authentication. As a result, automatic fake voice detection has become a relevant and challenging research problem. This work addresses the problem of distinguishing synthetically generated voices from real human speech using artificial intelligence techniques. Two state-of-the-art approaches are evaluated. The first approach is based on complex-valued deep learning and is motivated by the hypothesis that discriminative information between real and synthetic speech is partially embedded in the phase structure of the signal. By representing audio features in the complex domain, this model explicitly captures both magnitude and phase components, enabling the detection of subtle artifacts introduced during synthetic speech generation. The second approach relies on the pretrained Wav2Vec 2.0 transformer model, which learns robust speech representations through large-scale self-supervised training. Training and evaluation are conducted using a multilingual dataset collected from different countries and linguistic contexts. The dataset includes English speech from Ugandan speakers, Spanish speech from Colombian speakers, and Hungarian speech from native Hungarian speakers. Experimental results show that the Wav2Vec 2.0 model achieves F1-scores of 0.90 for English and 0.98 for Spanish, while the Complex-valued Convolutional Neural Network obtains an F1-score of 0.83 for Hungarian. These findings highlight the potential of both complex-valued models and foundation speech models to improve the security of synthetic voice generation systems in multilingual and cross-domain scenarios.

Liked Liked