To improve the sample efficiency of vision-based deep reinforcement learning (RL), we propose a novel method, called SPIRL, to automatically extract important patches from input images. Following Masked Auto-Encoders, SPIRL is based on Vision Transformer models pre-trained in a self-supervised fashion to reconstruct images from randomly-sampled patches. These pre-trained models can then be exploited to detect and select salient patches, defined as hard to reconstruct from neighboring patches. In RL, the SPIRL agent processes selected salient patches via an attention module. We empirically validate SPIRL on Atari games to test its data-efficiency against relevant state-of-the-art methods, including some traditional model-based methods and keypoint-based models. In addition, we analyze our model's interpretability capabilities.

为了提高基于视觉的深度强化学习的样本效率，我们提出了一种名为SPIRL的新方法，用于自动提取输入图像中的重要区域。SPIRL基于自编码器模型，在自监督训练的基础上，通过从随机采样的区域重建图像，再利用这些预训练模型检测和选择显著区域。在强化学习中，SPIRL代理通过注意力机制处理选定的显著区域。我们在Atari游戏上经验证明SPIRL在数据效率方面优于相关先进方法，包括一些传统的基于模型和基于关键点的模型。此外，我们还分析了我们模型的可解释性能力。

数据高效增强学习中的无监督显著路径选择