Multimodal Machine Learning in Healthcare: A Tutorial and Review
Deep learning has transformed healthcare by enabling the analysis of complex, high-dimensional, and heterogeneous data. However, traditional unimodal approaches often fail to capture the multifaceted nature of human health, as patient information is inherently distributed across multiple data modalities. Multimodal machine learning (MML) has therefore emerged as a framework for integrating complementary sources such as medical images, clinical text, electronic health records (EHRs), and physiological signals to support more comprehensive modeling of health and disease. This narrative review provides a structured overview of MML in healthcare, focusing on representative data modalities, fusion strategies, advanced architectures, and clinically relevant design trade-offs. In particular, we distinguish between stage-based fusion strategies, which determine when modalities are combined, and feature integration mechanisms, which define how modality representations are merged. We synthesize applications across major domains, including brain disorders, cancer prediction, chest-related conditions, and skin diseases, highlighting both the potential benefits and the limitations of multimodal approaches. We further discuss key challenges related to data heterogeneity, cross-modal alignment, missing modalities, and the complexity of effective fusion design, along with broader issues in clinical translation. Finally, we outline future directions centered on foundation models, causal reasoning, privacy-preserving learning, and integrated healthcare data infrastructures.