PI-VLA: A Symmetry-Aware Predictive and Interactive Vision–Language–Action Framework for Robust Robotic Manipulation

Vision–language–action (VLA) models often suffer from limited robustness in long-horizon manipulation tasks due to their inability to explicitly exploit structural symmetries and to react adaptively when such symmetries are violated by environmental uncertainty. To address this limitation, this paper proposes PI-VLA, a symmetry-aware predictive and interactive VLA framework for robust robotic manipulation. PI-VLA is built upon three key symmetry-driven principles. First, a Cognitive–Motor Synergy (CMS) module jointly generates discrete and continuous action chunks together with predictive world-model features in a single forward pass, enforcing cross-modal action consistency as an implicit symmetry constraint across heterogeneous action representations. Second, a unified training objective integrates imitation learning, reinforcement learning, and state prediction, encouraging invariance to task-relevant transformations while enabling adaptive symmetry breaking when long-horizon deviations emerge. Third, an Active Uncertainty-Resolving Decider (AURD) explicitly monitors action-consensus discrepancies and state prediction errors as symmetry-breaking signals, dynamically adjusting the execution horizon through closed-loop replanning. Extensive experiments demonstrate that PI-VLA achieves state-of-the-art performance, attaining a 73.2% average success rate on the LIBERO benchmark and an 88.3% success rate in real-world manipulation tasks under visual distractions and unseen conditions. Ablation studies confirm that symmetry-aware action consensus and uncertainty-triggered replanning are critical to robust execution. These results establish PI-VLA as a principled framework that leverages symmetry preservation and controlled symmetry breaking to enable reliable and interactive robotic manipulation.

Liked Liked