Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

arXiv:2602.10159v1 Announce Type: new
Abstract: Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of textbf{1,440 samples} spanning textbf{20 diverse categories} and textbf{four duration groups}, sourced from textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human “Recall-Search-Verify” cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.

Liked Liked