This paper proposes an algorithm that aims to improve generalization for reinforcement learning agents by removing overfitting to confounding features. Our approach consists of a max-min game theoretic objective. A generator transfers the style of observation during reinforcement learning. An additional goal of the generator is to perturb the observation, which maximizes the agent's probability of taking a different action. In contrast, a policy network updates its parameters to minimize the effect of such perturbations, thus staying robust while maximizing the expected future reward. Based on this setup, we propose a practical deep reinforcement learning algorithm, Adversarial Robust Policy Optimization (ARPO), to find a robust policy that generalizes to unseen environments. We evaluate our approach on Procgen and Distracting Control Suite for generalization and sample efficiency. Empirically, ARPO shows improved performance compared to a few baseline algorithms, including data augmentation.

该论文提出了一种算法，旨在通过消除对混淆特征的过度拟合来提高强化学习代理的泛化能力。我们的方法包括一个最大最小博弈论的目标，其中一个生成器在强化学习过程中传递观察的风格。生成器的额外目标是扰动观察，从而最大化代理采取不同行动的概率，而策略网络通过更新参数来最小化这种扰动的影响，同时最大化预期的未来奖励，从而保持稳健性。基于这一设置，我们提出了一种实用的深度强化学习算法ARPO（对抗鲁棒策略优化），以找到一个能够适应未知环境的鲁棒策略。我们在Procgen和Distracting Control Suite上评估了我们的方法的泛化能力和样本效率。实验证明，与一些基线算法（包括数据增强）相比，ARPO展现出了更好的性能。

深度强化学习中的稳健策略优化对抗风格转移