GLANCE: Gaze-Led Attention Network for Compressed Edge-inference

arXiv:2603.15717v1 Announce Type: new
Abstract: Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of $8.32^{circ}$ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50% and energy consumption by 65%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1% mAP on COCO (51.8% on attended objects) while maintaining sub-10,ms latency, meeting stringent AR/VR requirements by improving the communication time by $times 177$. Compared to the global YOLOv12n baseline, which achieves 39.2%, 63.4%, and 83.1% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3%, 72.1%, and 88.1% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.

Liked Liked