Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet, they generally do not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization. We first show that regularized MDPs are a particular instance of robust MDPs with uncertain reward. We thus establish that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs. We further extend this relationship to MDPs with uncertain transitions: this leads to a regularization term with an additional dependence on the value function. We then generalize regularized MDPs to twice regularized MDPs ($\text{R}^2$ MDPs), i.e., MDPs with $\textit{both}$ value and policy regularization. The corresponding Bellman operators enable us to derive planning and learning schemes with convergence and generalization guarantees, thus reducing robustness to regularization. We numerically show this two-fold advantage on tabular and physical domains, highlighting the fact that $\text{R}^2$ preserves its efficacy in continuous environments.

本研究致力于学习具有鲁棒特性的Markov决策过程。通过分析规则化的Markov决策过程，我们建立了奖励鲁棒Markov决策过程和规则化Markov决策过程之间的联系，并将该关系扩展到具有不确定性转移的Markov决策过程。进一步地，我们推广了规则化MDPs到双重规则化MDPs，并在表格和物理领域进行了数值验证。

具有双重正则化的马尔可夫决策过程: 强鲁棒性和正则化之间的等价性