We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinite-horizon discounted Markov decision process (MDP). While existing analyses of common approaches, such as fitted $Q$-iteration (FQI), suggest a $O(1/\sqrt{n})$ convergence for regret, empirical behavior exhibits much faster convergence. In this paper, we present a finer regret analysis that exactly characterizes this phenomenon by providing fast rates for the regret convergence. First, we show that given any estimate for the optimal quality function $Q^*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q^*$-estimate's pointwise convergence rate, thus speeding it up. The level of exponentiation depends on the level of noise in the decision-making problem, rather than the estimation problem. We establish such noise levels for linear and tabular MDPs as examples. Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees. As specific cases, our results imply $O(1/n)$ regret rates in linear cases and $\exp(-\Omega(n))$ regret rates in tabular cases.

本文研究了从固定行为策略生成的线下数据中学习无限时间折扣马尔可夫决策过程中的后悔问题，分析了$Q$-iteration（FQI）等常见方法的后悔收敛速度，并提供了较快的收敛率。其中，一种可行的方法是根据最优质量函数的任何估计，定义的策略的后悔以指数形式收敛于 $Q^*$ ，使其加速；同时，建立了这种噪声水平在线性和表形 MDP 中的应用。

离线强化学习遗憾的快速速率