LingBot-VLA vs π0.5 vs GR00T N1.6 vs WALL-OSS: 22,500 real-world trials across 3 platforms and 100 tasks

We just finished what I think is one of the larger controlled VLA comparisons on physical robots and wanted to share the results with this community, since the scaling and policy learning findings feel very relevant to RL.

The setup: 3 dual-arm platforms (Agibot G1, AgileX, Galaxea R1Pro), 100 manipulation tasks per platform from the GM-100 benchmark, 130 post-training trajectories per task, 15 evaluation trials per task per model. All four models were fine-tuned from their public checkpoints using the exact same data, hyperparameters (batch 256, 20 epochs), and hardware. Sequential evaluation on the same physical robot unit per task to eliminate hardware variance. Full results are in the paper (arXiv:2601.18692).

Here are the averaged results across all 3 embodiments:

Model Success Rate Progress Score
WALL-OSS 4.05% 10.35%
GR00T N1.6 7.59% 15.99%
π0.5 13.02% 27.65%
LingBot-VLA (no depth) 15.74% 33.69%
LingBot-VLA (w/ depth) 17.30% 35.41%

The depth integration uses a query-based distillation approach where learnable queries for each camera view are processed through the VLM backbone and aligned with depth embeddings via cross-attention projection. This adds spatial grounding without changing inference cost significantly. In simulation (RoboTwin 2.0, 50 tasks), the gap is even clearer: 88.56% vs 82.74% SR in clean scenes, 86.68% vs 76.76% in randomized scenes.

What I find most interesting from an RL perspective is the scaling behavior. LingBot-VLA uses flow matching as the action generation policy (conditional flow matching on action chunks of length 50), and the architecture is a Mixture-of-Transformers where the VLM and action expert share self-attention but have separate feedforward pathways. We scaled pretraining data from 3,000 to 20,000 hours of real-world teleoperation across 9 robot configs and tracked downstream success rates. The curve shows no saturation at 20K hours, which is a pretty strong signal that these flow-matching VLA policies have favorable scaling properties with respect to real-world data volume. This is the first systematic study I’m aware of that demonstrates this on physical robots rather than in simulation.

On the engineering side, the training codebase hits 261 samples/sec/GPU on an 8-GPU setup using FSDP2 with a hybrid sharding strategy for the action expert modules, FlexAttention for the sparse multimodal fusion, and torch.compile for operator fusion. That’s 1.5x to 2.8x faster than OpenPI, StarVLA, and Dexbotic depending on the VLM backbone, and it scales near-linearly out to 256 GPUs.

One thing worth noting: the absolute success rates are still quite low even for the best model (17.3% average across 100 tasks). The GM-100 benchmark is deliberately challenging with many fine-grained multi-step tasks, and ~50% of the atomic actions in the test set don’t appear in the top 100 training actions. So this is really testing generalization, not memorization. But it also highlights how far we are from reliable real-world manipulation policies.

Data efficiency is another interesting angle: with only 80 demonstrations per task, LingBot-VLA already outperforms π0.5 trained on the full 130 demonstrations, and the gap widens as you add more post-training data. This suggests the large-scale pretraining is doing meaningful work as a policy prior.

Everything is open-sourced:

Code: https://github.com/robbyant/lingbot-vla

Models: https://huggingface.co/collections/robbyant/lingbot-vla

Paper: https://arxiv.org/abs/2601.18692

Benchmark data is also released.

Curious what people think about flow matching vs diffusion vs autoregressive approaches for action generation in this regime. The no-saturation scaling result also raises the question of whether we’re just seeing the easy part of the curve or if there’s something fundamentally different about how these models scale compared to, say, offline RL approaches that tend to plateau much earlier.

submitted by /u/Ill_Awareness6706
[link] [comments]

Liked Liked