Beyond ReconVLA: Annotation-Free Visual Grounding via Language-Attention Masked Reconstruction
Replacing gaze annotations with language-driven attention masking makes robot perception annotation-free and up to 5x faster at inference. Picture a robot arm sitting across a table from you. You say: “Put the black bowl in the drawer.” The arm moves. But not toward the bowl. It hovers. It hesitates. Then it grabs the wrong thing. From the outside this looks like a minor coordination failure. From the inside, it is a fundamental problem with how the robot perceives […]