From Patient Emotion Recognition to Provider Understanding: A Multimodal Data Mining Framework for Clinical Counseling Communication Analysis
The computational analysis of therapeutic communication presents fundamental challenges in multi-label classification, severe class imbalance, and heterogeneous multimodal data integration. We introduce a comprehensive bidirectional framework that addresses patient emotion recognition and provider behavior analysis through advanced data mining techniques. For patient-side emotion recognition, we employ ClinicalBERT fine-tuned on human-annotated CounselChat comprising 1,482 counseling interactions across 25 emotion categories exhibiting class imbalance ratios reaching 60:1. Through frequency-stratified class weighting combined with dynamic per-class threshold optimization, we achieve macro-F1 of 0.74, representing a six-fold improvement over baseline multi-label approaches. Recognizing that patient emotion detection alone provides insufficient analytic utility, we extend our framework to provider-side behavior recognition using real-world psychotherapy sessions. We process 330 YouTube therapy sessions through an automated pipeline incorporating speaker diarization, automatic speech recognition, and temporal segmentation, yielding 14,086 annotated 10-second communication segments. Our provider-side architecture combines DeBERTa-v3-base for contextual text encoding with WavLM-base-plus for self-supervised audio representation learning, integrated through cross-modal attention mechanisms that learn content-dependent prosodic associations. On controlled human-annotated HOPE data comprising 178 sessions with approximately 12,500 utterances, the provider model achieves macro-F1 of 0.91 with Cohen’s kappa of 0.87, comparable to inter-rater reliability reported among trained human annotators in psychotherapy process research, outperforming simple concatenation-based fusion by 12 percentage points. On automatically annotated YouTube data, the model achieves macro-F1 of 0.71, demonstrating feasibility of analyzing naturalistic clinical communication at scale while highlighting the performance gap between controlled and real-world scenarios.