Recently, robust reinforcement learning (RL) methods against input observation have garnered significant attention and undergone rapid evolution due to RL's potential vulnerability. Although these advanced methods have achieved reasonable success, there have been two limitations when considering adversary in terms of long-term horizons. First, the mutual dependency between the policy and its corresponding optimal adversary limits the development of off-policy RL algorithms; although obtaining optimal adversary should depend on the current policy, this has restricted applications to off-policy RL. Second, these methods generally assume perturbations based only on the $L_p$-norm, even when prior knowledge of the perturbation distribution in the environment is available. We here introduce another perspective on adversarial RL: an f-divergence constrained problem with the prior knowledge distribution. From this, we derive two typical attacks and their corresponding robust learning frameworks. The evaluation of robustness is conducted and the results demonstrate that our proposed methods achieve excellent performance in sample-efficient off-policy RL.

本研究针对当前离线强化学习算法在面对长期关系下对抗者的相互依赖性和基于$L_p$范数的扰动假设的局限性，提出了新的视角：基于已知分布的f散度约束问题。通过该方法，我们推导出了两种典型攻击及其相应的稳健学习框架，实验结果表明所提方法在样本效率上表现优异。

通过软约束对抗者实现稳健的离线强化学习