Establishing robust policies is essential to counter attacks or disturbances affecting deep reinforcement learning (DRL) agents. Recent studies explore state-adversarial robustness and suggest the potential lack of an optimal robust policy (ORP), posing challenges in setting strict robustness constraints. This work further investigates ORP: At first, we introduce a consistency assumption of policy (CAP) stating that optimal actions in the Markov decision process remain consistent with minor perturbations, supported by empirical and theoretical evidence. Building upon CAP, we crucially prove the existence of a deterministic and stationary ORP that aligns with the Bellman optimal policy. Furthermore, we illustrate the necessity of $L^{\infty}$-norm when minimizing Bellman error to attain ORP. This finding clarifies the vulnerability of prior DRL algorithms that target the Bellman optimal policy with $L^{1}$-norm and motivates us to train a Consistent Adversarial Robust Deep Q-Network (CAR-DQN) by minimizing a surrogate of Bellman Infinity-error. The top-tier performance of CAR-DQN across various benchmarks validates its practical effectiveness and reinforces the soundness of our theoretical analysis.

建立强大的政策对抗或干扰深度强化学习代理至关重要，最近的研究探讨了状态对抗鲁棒性并暗示缺乏最优的鲁棒政策（ORP），从而在设置严格的鲁棒性约束方面提出了挑战。本文进一步研究了ORP，首先引入了政策一致性假设（CAP），即马尔可夫决策过程中的最优操作在轻微扰动下保持一致，通过实证和理论证据得到支持。在CAP的基础上，我们关键地证明了一种确定性和平稳的ORP的存在，并与Bellman最优政策相一致。此外，我们说明了在最小化Bellman误差以获得ORP时，L^∞-norm的必要性。这一发现阐明了针对具有L^1-norm的Bellman最优政策的先前DRL算法的脆弱性，并激励我们训练了一个一致对抗鲁棒深度Q网络（CAR-DQN），通过最小化Bellman无穷误差的替代品。CAR-DQN在各种基准测试中的顶级性能验证了其实际有效性，并加强了我们理论分析的可靠性。

优化对抗鲁棒Q学习与贝尔曼无穷误差