Synergistic Multimodal Diffusion Transformer: Unifying and Enhancing Multimodal Generation via Adaptive Discrete Diffusion
Current multimodal artificial intelligence suffers from fragmentation, with models typically optimized for single tasks, impeding efficient and uniform handling of diverse tasks like Text-to-Image (T2I), Image-to-Text (I2T), and Visual Question Answering (VQA) within a single framework. To address this, we propose the Synergistic Multimodal Diffusion Transformer (SyMDit), a novel unified discrete diffusion model. SyMDit integrates an Adaptive Cross-Modal Transformer (ACMT) with a Synergistic Attention Module (SAM) for dynamic interaction, alongside Hierarchical Semantic Visual Tokenization (HSVT) for multi-scale visual understanding and Context-Aware Text Embedding with special tokens for nuanced textual representation. Trained under a unified discrete diffusion paradigm, SyMDit employs a multi-stage strategy, including advanced data augmentation and selective masking. Our extensive evaluations demonstrate that SyMDit consistently achieves superior performance across T2I, I2T, and VQA tasks, outperforming existing baselines. Furthermore, SyMDit significantly enhances inference efficiency, offering substantial speedups compared to autoregressive and prior discrete diffusion methods. This work presents a significant step towards truly unified and efficient multimodal AI, offering a robust framework for general-purpose multimodal intelligence.