A fundamental challenge in reinforcement learning is to learn policies that generalize beyond the operating domain experienced during training. In this paper, we approach this challenge through the following invariance principle: an agent must find a representation such that there exists an action-predictor built on top of this representation that is simultaneously optimal across all training domains. Intuitively, the resulting invariant policy enhances generalization by finding causes of successful actions. We propose a novel learning algorithm, Invariant Policy Optimization (IPO), that explicitly enforces this principle and learns an invariant policy during training. We compare our approach with standard policy gradient methods and demonstrate significant improvements in generalization performance on unseen domains for Linear Quadratic Regulator (LQR) problems and our own benchmark in the MiniGrid Gym environment.

本研究针对强化学习中泛化性不足的问题，提出了一种基于不变性原则的学习算法Invariance Policy Optimization (IPO)，该算法能够在训练过程中学习到一种不变策略，并能够在线性二次调节器和网格世界问题以及机器人学习开门问题上表现出良好的泛化性能。

不变的策略优化: 强化学习中更强的泛化能力