Reinforcement learning lacks a principled measure of optimality, causing research to rely on algorithm-to-algorithm or baselines comparisons with no certificate of optimality. Focusing on finite state and action Markov decision processes (MDP), we develop a simple, computable gap function that provides both upper and lower bounds on the optimality gap. Therefore, convergence of the gap function is a stronger mode of convergence than convergence of the optimality gap, and it is equivalent to a new notion we call distribution-free convergence, where convergence is independent of any problem-dependent distribution. We show the basic policy mirror descent exhibits fast distribution-free convergence for both the deterministic and stochastic setting. We leverage the distribution-free convergence to a uncover a couple new results. First, the deterministic policy mirror descent can solve unregularized MDPs in strongly-polynomial time. Second, accuracy estimates can be obtained with no additional samples while running stochastic policy mirror descent and can be used as a termination criteria, which can be verified in the validation step.

本研究解决了强化学习中缺乏最佳性原则度量的问题，通过发展一种简单可计算的间隙函数，提供了最佳性间隙的上下界。研究表明，基本的政策镜像下降法在确定性和随机性设置下表现出快速的无分布收敛，这一新结果有助于在强多项式时间内解决未正则化的马尔可夫决策过程，并在运行随机政策镜像下降时无需额外样本即可获得准确性估计。 

政策梯度方法的强多项式时间和验证分析