Training VLM agents is broken and nobody talks about why
Been going deep on multi-turn VLM agent training lately and keep running into the same fundamental problem that I think the field is underreacting to: credit assignment across long trajectories is genuinely unsolved, and most people are patching around it rather than fixing it.
The core issue is simple to describe and brutal to solve. Your agent takes 20 actions, gets a reward signal at the end, and you need to figure out which 3 actions actually mattered. Standard GRPO compares rollouts at the trajectory level, which works fine for short single-turn tasks. Stretch that out to multi-step visual reasoning or tool-use chains and the signal becomes almost meaninglessly diffuse.
What’s interesting is that recent approaches like GROW are attacking this at the structural level rather than the model level. The insight is that how you construct and sample from trajectories during training matters more than which base model you start from. Trajectory architecture, essentially, is the lever.
This flips the usual conversation. Everyone obsesses over model scale and benchmark scores, but if your training loop can’t assign credit cleanly across steps, you’re leaving enormous performance on the table regardless of how big your model is.
Curious whether others have hit this wall practically. Are you solving it through reward shaping, trajectory segmentation, something else entirely? And does anyone think trajectory-level GRPO is salvageable for genuinely long-horizon tasks, or is structural reform the only real path forward?
submitted by /u/obliq_news
[link] [comments]