We propose a method for sim-to-real robot learning which exploits simulator state information in a way that scales to many objects. First, we train a pair of encoders on raw object pose targets to learn representations that accurately capture the state information of a multi-object environment. Second, we use these encoders in a reinforcement learning algorithm to train image-based policies capable of manipulating many objects. Our pair of encoders consists of one which consumes RGB images and is used in our policy network, and one which directly consumes a set of raw object poses and is used for reward calculation and value estimation. We evaluate our method on the task of pushing a collection of objects to desired tabletop regions. Compared to methods which rely only on images or use fixed-length state encodings, our method achieves higher success rates, performs well in the real world without fine tuning, and generalizes to different numbers and types of objects not seen during training.

提出了一种基于模拟器状态信息用于面向多物体的机器人学习的方法：通过训练一对编码器网络来捕捉潜变量空间中的多物体状态信息，其中一个编码器是卷积神经网络，另一个是图神经网络状态编码器，这使得我们的系统能够操作现实世界中的RGB图像，有效地进行多物体操纵的强化学习训练，取得比传统基于图像或固定长度状态编码的方法更高的成功率，在不调参的情况下也在真实世界中表现良好，并且泛化到在训练时未见过的不同数量和类型的物体。

使用基于物理环境的状态表示学习对物体集合进行操作