Vocabulary Restriction of VLAs (Vision Language Action)
Hello,
I wanted to ask how do you restrict the output vocabulary/ possible actions of VLAs. Specifically I am reading currently the research papers of RT-2 and OpenVLA. OpenVLA references RT-2 and RT-2 says nothing specifically, it just says in the fine-tuning phase:
“Thus, to ensure that RT-2 outputs valid action tokens during decoding, we constrain its output vocabulary via only sampling valid action tokens when the model is prompted with a robot-action task …”
So do you just crop or clamp it? Or is there another variant?
Also I would really appriciate if you could recommend some papers, blog, or any other resources, where I can learn VLAs in detail
submitted by /u/Papabaer06
[link] [comments]
Like
0
Liked
Liked