In this paper, we study the non-asymptotic performance of optimal policy on robust value function with true transition dynamics. The optimal robust policy is solved from a generative model or offline dataset without access to true transition dynamics. In particular, we consider three different uncertainty sets including the $L_1$, $\chi^2$ and KL balls in both $(s,a)$-rectangular and $s$-rectangular assumptions. Our results show that when we assume $(s,a)$-rectangular on uncertainty sets, the sample complexity is about $\widetilde{O}\left(\frac{|\mathcal{S}|^2|\mathcal{A}|}{\varepsilon^2\rho^2(1-\gamma)^4}\right)$ in the generative model setting and $\widetilde{O}\left(\frac{|\mathcal{S}|}{\nu_{\min}\varepsilon^2\rho^2(1-\gamma)^4}\right)$ in the offline dataset setting. While prior works on non-asymptotic performances are restricted with the KL ball and $(s,a)$-rectangular assumption, we also extend our results to a more general $s$-rectangular assumption, which leads to a larger sample complexity than the $(s,a)$-rectangular assumption.

本文研究了鲁棒马尔可夫决策过程的最优鲁棒策略和价值函数的非渐近和渐近性能，并考虑了不同的不确定性集。实验验证了最优鲁棒价值函数在理论和实际应用中均呈现出典型的 √n 比例的渐近正态性。

稳健性马尔可夫决策过程理论研究：样本复杂度与渐近性