Multimodal Datasets with Controllable Mutual Information
arXiv:2510.21686v2 Announce Type: replace Abstract: We introduce a framework for generating highly multimodal datasets with explicitly calculable mutual information (MI) between modalities. This enables the construction of benchmark datasets that provide a novel testbed for systematic studies of mutual information estimators and multimodal self-supervised learning (SSL) techniques. Our framework constructs realistic datasets with known MI using a flow-based generative model and a structured causal framework for generating correlated latent variables. We benchmark a suite of MI estimators on […]