MiniCausal-T2V: Towards Ultra-Low Latency and Memory-Efficient Causal Video Generation on Edge Devices

The proliferation of Text-to-Video (T2V) generation technologies has opened new avenues for content creation, yet deploying these advanced models on resource-constrained edge devices remains a significant challenge due to their inherent complexity and high computational demands. This paper introduces MiniCausal-T2V (MCT-Video), an innovative, end-to-end optimized causal latent video diffusion model meticulously engineered for ultra-low latency and memory-efficient T2V generation on edge platforms, particularly Qualcomm Hexagon NPUs. MCT-Video distinguishes itself through a suite of synergistic innovations: a Lightweight Causal Transformer Backbone designed from scratch for intrinsic efficiency and causality, an Adaptive Sparse Temporal Attention mechanism for dynamic temporal computation reduction, Quantization-Aware Fine-tuning for robust precision deployment, a Unified Multi-objective Distillation strategy to holistically transfer knowledge, and Extreme Step Flow-Matching Inference for rapid generation. Extensive experimental evaluations demonstrate that MCT-Video not only achieves superior video quality across comprehensive VBench metrics and human perception but also sets new benchmarks for efficiency, achieving unprecedented end-to-end inference latency and a minimal memory footprint on Hexagon NPUs, substantially outperforming existing edge-optimized solutions. This work represents a significant step towards enabling high-quality, real-time T2V capabilities directly on portable devices.

Liked Liked