Robust reinforcement learning (RL) considers the problem of learning policies that perform well in the worst case among a set of possible environment parameter values. In real-world environments, choosing the set of possible values for robust RL can be a difficult task. When that set is specified too narrowly, the agent will be left vulnerable to reasonable parameter values unaccounted for. When specified too broadly, the agent will be too cautious. In this paper, we propose Feasible Adversarial Robust RL (FARR), a method for automatically determining the set of environment parameter values over which to be robust. FARR implicitly defines the set of feasible parameter values as those on which an agent could achieve a benchmark reward given enough training resources. By formulating this problem as a two-player zero-sum game, FARR jointly learns an adversarial distribution over parameter values with feasible support and a policy robust over this feasible parameter set. Using the PSRO algorithm to find an approximate Nash equilibrium in this FARR game, we show that an agent trained with FARR is more robust to feasible adversarial parameter selection than with existing minimax, domain-randomization, and regret objectives in a parameterized gridworld and three MuJoCo control environments.

本文提出了可行的对抗性强化学习(FARR) 方法来自动确定环境参数的范围，通过将该问题作为二人零和博弈，最优化FARR目标可以在可行支持上产生对抗性分布和策略鲁棒，在参数化的网格世界和三个MuJoCo控制环境中证明，使用FARR训练的优化代理相对于现有的极小化、域随机化和后悔目标在可行对抗参数选择上更具鲁棒性。

可行的针对不完全规定环境的对抗鲁棒强化学习