Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response
arXiv:2602.15237v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in emergency first response (EFR) applications to support situational awareness (SA) and decision-making, yet most operate on text or 2D imagery and offer little support for core EFR SA competencies like spatial reasoning. We address this gap by evaluating a prototype that fuses robot-mounted depth sensing and YOLO detection with a vision language model (VLM) capable of verbalizing metrically-grounded distances of detected objects (e.g., the chair […]