In this work, we consider and analyze the sample complexity of model-free
reinforcement learning with a generative model. Particularly, we analyze mirror
descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al.
(2020a), which uses the Kullback-Leibler divergence and entropy regularization
in its value and policy updates. Our analysis shows that it is nearly
minimax-optimal for finding an $\varepsilon$-optimal policy when $\varepsilon$
is sufficiently small. This is the first theoretical result that demonstrates
that a simple model-free algorithm without variance-reduction can be nearly
minimax-optimal under the considered setting.

本文研究了使用生成模型的无模型强化学习的样本复杂性，重点分析了使用 Kullback-Leibler 散度和熵正则化在值和策略更新中的 Geist 等人（2019）和 Vieillard 等人的 Mirror descent value iteration（MDVI），并证明了在 ε 足够小的情况下，该算法几乎是极小值 - 最优的，这是第一个证明了在所考虑的条件下，一个简单的无模型算法（不执行方差缩减）几乎是极小值 - 最优的理论结果。

具有生成模型的 KL 熵正则化强化学习是极小极大值最优的

KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal

We revisit the incremental autonomous exploration problem proposed by Lim &
Auer (2012). In this setting, the agent aims to learn a set of near-optimal
goal-conditioned policies to reach the $L$-controllable states: states that are
incrementally reachable from an initial state $s_0$ within $L$ steps in
expectation. We introduce a new algorithm with stronger sample complexity
bounds than existing ones. Furthermore, we also prove the first lower bound for
the autonomous exploration problem. In particular, the lower bound implies that
our proposed algorithm, Value-Aware Autonomous Exploration, is nearly
minimax-optimal when the number of $L$-controllable states grows polynomially
with respect to $L$. Key in our algorithm design is a connection between
autonomous exploration and multi-goal stochastic shortest path, a new problem
that naturally generalizes the classical stochastic shortest path problem. This
new problem and its connection to autonomous exploration can be of independent
interest.

该研究重新审视了 Lim＆Auer（2012）提出的增量自主探索问题，提出了一种新算法，并证明了该算法在控制状态数多项式增长时是几乎极小化的。