VideoStylist: Text-to-Consistent Video Stylization with Temporal Anchor Tokens

Extending Text-to-Image (T2I) generation to Text-Guided Video Stylization (T2GVS) presents significant challenges in temporal consistency, style fidelity, and fine-grained control. Naive frame-by-frame T2I application results in severe flickering. We propose VideoStylist, a novel diffusion model extending a pre-trained T2I U-Net to a four-dimensional architecture for high-quality stylized video generation. Key innovations are Temporal Anchor Tokens (TATs) globally anchoring style semantics across frames, mitigating flickering, and an Adaptive Spatio-Temporal Consistency Module (ASTCM) to enhance local coherence and smooth transitions via dynamic spatio-temporal attention. A diverse video-text dataset was constructed using a dual strategy, combining LLM-generated descriptions and extending T2I datasets with weak labels. Extensive experiments show VideoStylist significantly outperforms state-of-the-art baselines across Style Fidelity, Temporal Consistency, and Perceptual Quality, achieving superior performance and strong user preference. Ablation studies confirm the critical contributions of TATs and ASTCM. VideoStylist advances T2GVS, delivering stable, high-fidelity, and visually appealing stylized video content.

Liked Liked