Contrastive Representation Learning for Voice-Based Autistic Trait Identification

digitado ⋅ 15 de April de 2026

Early identification of Autism Spectrum Disorder (ASD) traits in infants is crucial for early intervention, which can greatly improve the child’s quality of life. Solutions that use voice analysis offer a promising non-invasive way to detect ASD. However, most current studies depend on extracting specific voice markers from certain datasets and do not include validation across different groups. In this paper, we propose a supervised contrastive learning method for identifying ASD based on infant vocalizations. We extend the Time-Frequency Consistency (TF-C) framework from self-supervised learning to a contrastive approach that uses labels. Our method takes advantage of both time-related and frequency-related data through a dual-branch encoder. It applies supervised contrastive constraints during pre-training to reduce variation within classes while boosting separation between different classes in the embedding space. We pre-train the model using diagnostic labels on a dataset that includes typically developing (TD), Attention-Deficit Hyperactivity Disorder (ADHD), and ASD infants from an open-access dataset, and then fine-tune it with a simple classification head. Evaluation on a cross-cohort group of participants shows the model generalizes well and can distinguish ASD from non-ASD infants, achieving up to 100.00 % accuracy on non-verbal vocalizations.

Like 0

Liked Liked