Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D \to D$ setting from 88.7% to 96.2%, and in the $ABC \to D$ setting from 82.4% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. https://github.com/liufanfanlff/RoboUniview

利用视觉语言模型（VLMs）进行机器人操纵的研究提出了一种新的范式，旨在增强模型对新对象和指令的推广能力。为解决摄像机规格和安装位置的变化带来的性能差异，该研究提出了RoboUniView方法，该方法从多个视角学习统一的视图表示，并从该表示中得出操纵机器人的动作。该统一的视图表示更准确地反映了物理世界，不受机器人平台摄像机参数的限制，并在CALVIN基准测试中获得了最先进的性能，将成功率从88.7％提高到96.2％。此外，该模型还表现出卓越的适应性和灵活性：它在未知摄像机参数下保持高性能，可以利用具有不同摄像机参数的多个数据集，并能够在数据集之间进行联合多任务学习。研究代码已提供以供重新实现。

RoboUniView: 统一视角表征的视觉语言模型用于机器人操作