A Clinically Guided Rule-Based Synthetic Dataset for Multi-Modal Longitudinal Treatment-Response Monitoring in Major Depressive Disorder
Monitoring treatment response in Major Depressive Disorder (MDD) remains challenging since treatment selection often follows a trial-and-error approach and access to real-world multimodal mental health data is limited by privacy, ethical, and availability constraints. This study presents a methodological approach for designing and generating a clinically guided, rule-based synthetic multimodal dataset to support early-stage experimentation in MDD treatment-response monitoring. Digital biomarkers relevant to depression were identified through literature and expert consultation. Patient Health Questionnaire-9 (PHQ-9) scores were used as the primary clinical anchor, while simulated smartphone and wearable indicators were organized into composite domains, including sleep, activity, mobility, physiology, social interaction, digital behavior, adherence, ecological momentary assessment, and missingness. The synthetic data schema guided the generation of a 12-week acute-phase dataset incorporating baseline characteristics, daily monitoring variables, biweekly PHQ-9 assessments, treatment review points, and derived clinical labels, including response, remission, and trajectory groups. The resulting dataset demonstrated statistical, distributional, temporal, dependency, and trajectory-level plausibility. This work contributes a transparent and reproducible framework for synthetic data generation in privacy-sensitive mental health research and provides a controlled testbed for future machine learning and federated learning experiments.