Real-Time Streaming Text-to-Video Editing with a Diffusion Transformer
The current paradigm of Text-to-Video (T2V) generation struggles with real-time, interactive applications due to models designed for offline, fixed-length video synthesis. This limitation creates challenges in maintaining long-term temporal consistency and achieving low latency for interactive content creation. We introduce StreamEdit-DiT, a novel framework for real-time streaming text-to-video editing. Our approach extensively modifies the Diffusion Transformer (DiT) architecture, incorporating a Multi-Scale Adaptive DiT enhanced with a Progressive Temporal Consistency Module (PTCM) and Dynamic Sparse Attention (DSA) to optimize coherence and computational efficiency. A comprehensive training methodology features Streaming Coherence Matching (SCM) and an Adaptive Sliding Window (ASW) buffer, complemented by a Hierarchical Progressive Distillation strategy for efficient inference. Evaluated on a custom benchmark, StreamEdit-DiT significantly outperforms existing streaming and consistency methods, demonstrating superior prompt adherence, edit fidelity, and overall quality. Crucially, our distilled model achieves high resolution, real-time frame rates, and very low latency on a single H100 GPU, validating its practical applicability for interactive video editing.