In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model. Particularly, we analyze mirror descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al. (2020a), which uses the Kullback-Leibler divergence and entropy regularization in its value and policy updates. Our analysis shows that it is nearly minimax-optimal for finding an $\varepsilon$-optimal policy when $\varepsilon$ is sufficiently small. This is the first theoretical result that demonstrates that a simple model-free algorithm without variance-reduction can be nearly minimax-optimal under the considered setting.

本文研究了使用生成模型的无模型强化学习的样本复杂性，重点分析了使用Kullback-Leibler散度和熵正则化在值和策略更新中的Geist等人（2019）和Vieillard等人的Mirror descent value iteration（MDVI），并证明了在ε足够小的情况下，该算法几乎是极小值-最优的，这是第一个证明了在所考虑的条件下，一个简单的无模型算法（不执行方差缩减）几乎是极小值-最优的理论结果。

具有生成模型的KL熵正则化强化学习是极小极大值最优的