Structured Modeling and Representation Methods for Post-Retrieval Inference Processes in Large Video Language Models
Existing Video-RAG systems often concatenate retrieved segments directly into input, leading toreasoning drift when hard negative samples are introduced. This paper proposes a Structured Post-Retrieval Reasoning (SPRR) module for Large Video Language Models (LVLMs), explicitly modelingthe post-retrieval process into three stages:(1) Evidence Validation: Generates “decidable” sub-problems (3–8) for Top-k=20 candidate clips, outputs binary/numeric scores, and filters to k′=4–6;(2) Conflict Resolution: Establishes consistency constraints (e.g., temporal order, entity attributeinvariance) for contradictory information across multiple clips, selecting the minimum conflictsubset to form a coherent evidence pool;(3) Temporal Aggregation: Indexed by event timestamps,evidence is serialized to generate interpretable reasoning chains (including referenced clip IDs andtemporal ranges).Evaluated on MLVU (3,102 QA) and LongVideoBench (6,678 MCQ) using open-ended and multiple-choice formats respectively, while measuring interpretability metrics (averageevidence count, conflict rate, reasoning chain length) and efficiency metrics (input tokens/reasoningsteps). This validates SPRR’s benefits in “reducing noise, enhancing interpretability, and improvingstability.