We study the problem of robust reinforcement learning under adversarial
corruption on both rewards and transitions. Our attack model assumes an
\textit{adaptive} adversary who can arbitrarily corrupt the reward and
transition at every step within an episode, for at most $\epsilon$-fraction of
the learning episodes. Our attack model is strictly stronger than those
considered in prior works. Our first result shows that no algorithm can find a
better than $O(\epsilon)$-optimal policy under our attack model. Next, we show
that surprisingly the natural policy gradient (NPG) method retains a natural
robustness property if the reward corruption is bounded, and can find an
$O(\sqrt{\epsilon})$-optimal policy. Consequently, we develop a Filtered Policy
Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and
can find an $O(\epsilon^{1/4})$-optimal policy. We emphasize that FPG is the
first that can achieve a meaningful learning guarantee when a constant fraction
of episodes are corrupted. Complimentary to the theoretical results, we show
that a neural implementation of FPG achieves strong robust learning performance
on the MuJoCo continuous control benchmarks.

本文研究在奖励和转移方面存在敌对性干扰的鲁棒强化学习问题，并提出了天然策略梯度方法和筛选策略梯度算法可解决该问题，并在 MuJoCo 连续控制基准测试中取得了比较强的鲁棒性。