Learning to solve precision-based manipulation tasks from visual feedback using Reinforcement Learning (RL) could drastically reduce the engineering efforts required by traditional robot systems. However, performing fine-grained motor control from visual inputs alone is challenging, especially with a static third-person camera as often used in previous work. We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist. While the third-person camera is static, the egocentric camera enables the robot to actively control its vision to aid in precise manipulation. To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism that models spatial attention from one view to another (and vice-versa), and use the learned features as input to an RL policy. Our method improves learning over strong single-view and multi-view baselines, and successfully transfers to a set of challenging manipulation tasks on a real robot with uncalibrated cameras, no access to state information, and a high degree of task variability. In a hammer manipulation task, our method succeeds in 75% of trials versus 38% and 13% for multi-view and single-view baselines, respectively.

本研究提出了一种基于视觉反馈的精细操作任务解决方案，结合第三人称相机和机器人手腕上的自我中心相机的可视化反馈，使用Transformers跨视图关注机制来有效融合两个视图的信息，并将其作为强化学习策略的输入。实验结果表明，该方法相对于基线（single-view, multi-view）有明显的学习优势，并能够成功地转移到具有不稳定摄像头、无状态信息和高任务变异度的实际机器人操作任务中。

以Transformer桥接自我中心和第三人称视角，用于机器人操纵的深入研究