In recent years, the transformer architecture has become the de facto standard for machine learning algorithms applied to natural language processing and computer vision. Despite notable evidence of successful deployment of this architecture in the context of robot learning, we claim that vanilla transformers do not fully exploit the structure of the robot learning problem. Therefore, we propose Body Transformer (BoT), an architecture that leverages the robot embodiment by providing an inductive bias that guides the learning process. We represent the robot body as a graph of sensors and actuators, and rely on masked attention to pool information throughout the architecture. The resulting architecture outperforms the vanilla transformer, as well as the classical multilayer perceptron, in terms of task completion, scaling properties, and computational efficiency when representing either imitation or reinforcement learning policies. Additional material including the open-source code is available at https://sferrazza.cc/bot_site.

本文研究了传统变换器在机器人学习中的不足，提出了身体变换器（BoT）架构，通过将机器人身体表示为传感器和执行器的图，利用遮蔽注意力优化学习过程。研究表明，BoT在任务完成、扩展性和计算效率方面优于传统变换器和多层感知器，具有重要的应用潜力。

身体变换器：利用机器人实体进行策略学习