We study the sample complexity of learning an $\varepsilon$-optimal policy in an average-reward Markov decision process (MDP) under a generative model. We establish the complexity bound $\widetilde{O}\left(SA\frac{H}{\varepsilon^2} \right)$, where $H$ is the span of the bias function of the optimal policy and $SA$ is the cardinality of the state-action space. Our result is the first that is minimax optimal (up to log factors) in all parameters $S,A,H$ and $\varepsilon$, improving on existing work that either assumes uniformly bounded mixing times for all policies or has suboptimal dependence on the parameters. Our result is based on reducing the average-reward MDP to a discounted MDP. To establish the optimality of this reduction, we develop improved bounds for $\gamma$-discounted MDPs, showing that $\widetilde{O}\left(SA\frac{H}{(1-\gamma)^2\varepsilon^2} \right)$ samples suffice to learn a $\varepsilon$-optimal policy in weakly communicating MDPs under the regime that $\gamma \geq 1 - \frac{1}{H}$, circumventing the well-known lower bound of $\widetilde{\Omega}\left(SA\frac{1}{(1-\gamma)^3\varepsilon^2} \right)$ for general $\gamma$-discounted MDPs. Our analysis develops upper bounds on certain instance-dependent variance parameters in terms of the span parameter. These bounds are tighter than those based on the mixing time or diameter of the MDP and may be of broader use.

我们研究了一个基于生成模型的平均回报马尔科夫决策过程（MDP）中学习一个ε-最优策略的样本复杂度，建立了复杂度界限Ω(SA(H/ε^2))。我们的结果在参数S、A、H和ε上是极小极大最优的（最多有对数系数），进一步改进了现有工作，其中要么假定所有策略的混合时间均匀有界，要么对参数有次优的依赖。我们的结果基于将平均回报MDP简化为折扣MDP。为了证明这种简化的最优性，我们对γ折扣MDP进行了改进的界限，显示了在γ≥1-1/H的情况下，采样Ω(SA(H/((1-γ)^2ε^2)))足以在弱通信MDP中学习ε-最优策略，从而规避了适用于一般γ折扣MDP的Ω(SA/(1-γ)^3ε^2)的已知下限。我们的分析以跨度参数为基础，对某些实例相关方差参数进行了上界估计，这些上界比基于MDP的混合时间或直径的估计更紧凑，可能具有更广泛的应用。

基于区间的平均奖励MDP的最优样本复杂度