Several recent works have proposed instance-dependent upper bounds on the number of episodes needed to identify, with probability $1-\delta$, an $\varepsilon$-optimal policy in finite-horizon tabular Markov Decision Processes (MDPs). These upper bounds feature various complexity measures for the MDP, which are defined based on different notions of sub-optimality gaps. However, as of now, no lower bound has been established to assess the optimality of any of these complexity measures, except for the special case of MDPs with deterministic transitions. In this paper, we propose the first instance-dependent lower bound on the sample complexity required for the PAC identification of a near-optimal policy in any tabular episodic MDP. Additionally, we demonstrate that the sample complexity of the PEDEL algorithm of \cite{Wagenmaker22linearMDP} closely approaches this lower bound. Considering the intractability of PEDEL, we formulate an open question regarding the possibility of achieving our lower bound using a computationally-efficient algorithm.

这篇文章提出了第一个关于任何表格化情节型马尔可夫决策过程（MDP）中需要样本复杂性的PAC识别近似最优策略的实例相关下界，并证明了PEDEL算法的样本复杂度接近这个下界。鉴于PEDEL计算的复杂性，我们提出了一个关于能否使用计算高效的算法达到我们的下界的开放性问题。

在线PAC强化学习中追求实例优势