Vocabulary Restriction of VLAs (Vision Language Action)

Hello,

I wanted to ask how do you restrict the output vocabulary/ possible actions of VLAs. Specifically I am reading currently the research papers of RT-2 and OpenVLA. OpenVLA references RT-2 and RT-2 says nothing specifically, it just says in the fine-tuning phase:

“Thus, to ensure that RT-2 outputs valid action tokens during decoding, we constrain its output vocabulary via only sampling valid action tokens when the model is prompted with a robot-action task …”

So do you just crop or clamp it? Or is there another variant?
Also I would really appriciate if you could recommend some papers, blog, or any other resources, where I can learn VLAs in detail

submitted by /u/Papabaer06
[link] [comments]

Liked Liked