Adaptive-Guided Latent Diffusion for Video Counterfactual Explanations with Multi-Scale Perceptual Refinement

The increasing reliance on deep learning models in video understanding necessitates transparent and interpretable decision-making, with video counterfactual explanations (CEs) offering a critical avenue to understand model behavior. However, generating effective video CEs remains challenging due to video data’s high dimensionality, temporal coherence demands, and the need to balance precise alterations with visual realism. Existing Latent Diffusion Model (LDM)-based CE methods often struggle with generation efficiency, exact target adherence, and the suppression of subtle visual artifacts. To address these limitations, we propose an Adaptive-Guided Latent Diffusion Model for Counterfactual Explanations (AG-LDM-CE). Our framework introduces an Adaptive Gradient Guidance (AGG) mechanism that dynamically adjusts guidance strength based on proximity to the target prediction, optimizing efficiency and balancing target adherence with visual fidelity. Complementing this, a novel Multi-Scale Perceptual Refinement (MSPR) module leverages multi-level VAE features to intelligently suppress artifacts and ensure that counterfactual changes are accurately localized to causally relevant regions. Extensive evaluations across diverse video regression (EchoNet-Dynamic) and classification (FERV39K, Something-Something V2) tasks demonstrate AG-LDM-CE’s superior performance. Our method significantly improves generation efficiency and explanation quality, achieving strong target adherence and perceptual realism. Ablation studies and human evaluations further validate the significant contributions of AGG and MSPR, confirming that AG-LDM-CE generates more efficient, accurate, perceptually realistic, and precisely localized video counterfactual explanations.

Liked Liked