Extensive efforts have been made to improve the generalization ability of Reinforcement Learning (RL) methods via domain randomization and data augmentation. However, as more factors of variation are introduced during training, the optimization process becomes increasingly more difficult, leading to low sample efficiency and unstable training. Instead of learning policies directly from augmented data, we propose SOft Data Augmentation (SODA), a method that decouples augmentation from policy learning. Specifically, SODA imposes a soft constraint on the encoder that aims to maximize the mutual information between latent representations of augmented and non-augmented data, while the RL optimization process uses strictly non-augmented data. Empirical evaluations are performed on diverse tasks from DeepMind Control suite as well as a robotic manipulation task, and we find SODA to significantly advance sample efficiency, generalization, and stability in training over state-of-the-art vision-based RL methods.

本文提出了SOft Data Augmentation（SODA）方法，通过在编码器上施加约束，最大化增强和非增强数据的潜在表示之间的互信息，从而提高强化学习的样本效率、泛化能力和稳定性，实验表明该方法显著优于最先进的基于视觉的RL方法。

软数据增强在强化学习中的泛化