Standard RL algorithms assume fixed environment dynamics and require a significant amount of interaction to adapt to new environments. We introduce Policy-Dynamics Value Functions (PD-VF), a novel approach for rapidly adapting to dynamics different from those previously seen in training. PD-VF explicitly estimates the cumulative reward in a space of policies and environments. An ensemble of conventional RL policies is used to gather experience on training environments, from which embeddings of both policies and environments can be learned. Then, a value function conditioned on both embeddings is trained. At test time, a few actions are sufficient to infer the environment embedding, enabling a policy to be selected by maximizing the learned value function (which requires no additional environment interaction). We show that our method can rapidly adapt to new dynamics on a set of MuJoCo domains. Code available at https://github.com/rraileanu/policy-dynamics-value-functions.

介绍了一种新的Policy-Dynamics Value Functions方法用于快速适应不同于之前训练环境的动态环境，方法利用强化学习技术，通过学习环境和策略在嵌入空间中的表示并进行价值函数的训练，能够在少量交互中，通过学习后的价值函数快速适应不同动态环境，实验表明本方法在MuJoCo环境下有较好的表现。

基于策略动态价值函数的快速适应