This paper focuses on reinforcement learning for the regularized robust Markov decision process (MDP) problem, an extension of the robust MDP framework. We first introduce the risk-sensitive MDP and establish the equivalence between risk-sensitive MDP and regularized robust MDP. This equivalence offers an alternative perspective for addressing the regularized RMDP and enables the design of efficient learning algorithms. Given this equivalence, we further derive the policy gradient theorem for the regularized robust MDP problem and prove the global convergence of the exact policy gradient method under the tabular setting with direct parameterization. We also propose a sample-based offline learning algorithm, namely the robust fitted-Z iteration (RFZI), for a specific regularized robust MDP problem with a KL-divergence regularization term and analyze the sample complexity of the algorithm. Our results are also supported by numerical simulations.

论文探讨了基于强化学习的鲁棒性Markov决策问题，提出了风险敏感MDP和正则化鲁棒MDP之间的等价关系，并且导出用于正则化鲁棒MDP问题的策略梯度定理，提出了基于样本的离线学习算法RFZI来解决正则化鲁棒MDP问题，并分析了算法的样本复杂度。

正则化鲁棒MDPs和风险敏感MDPs：等价性，策略梯度和采样复杂度