Stochastic Approximation (SA) is a widely used algorithmic approach in
various fields, including optimization and reinforcement learning (RL). Among
RL algorithms, Q-learning is particularly popular due to its empirical success.
In this paper, we study asynchronous Q-learning with constant stepsize, which
is commonly used in practice for its fast convergence. By connecting the
constant stepsize Q-learning to a time-homogeneous Markov chain, we show the
distributional convergence of the iterates in Wasserstein distance and
establish its exponential convergence rate. We also establish a Central Limit
Theory for Q-learning iterates, demonstrating the asymptotic normality of the
averaged iterates. Moreover, we provide an explicit expansion of the asymptotic
bias of the averaged iterate in stepsize. Specifically, the bias is
proportional to the stepsize up to higher-order terms and we provide an
explicit expression for the linear coefficient. This precise characterization
of the bias allows the application of Richardson-Romberg (RR) extrapolation
technique to construct a new estimate that is provably closer to the optimal Q
function. Numerical results corroborate our theoretical finding on the
improvement of the RR extrapolation method.

通过将常步长 Q 学习与时间齐次马尔可夫链连接，在 Wasserstein 距离中展示了迭代的分布收敛性，建立了其指数收敛速度；我们还为 Q 学习迭代建立了中心极限定理，证明了平均迭代的渐近正态性；此外，我们提供了对步长渐近偏差的显式扩展，具体而言，偏差与步长成比例，我们为线性系数提供了一个明确的表达式；这个对偏差的精确刻画允许应用 Richardson-Romberg 外推技术来构造一个新估计，该估计可证明比最优的 Q 函数更接近；数值结果证实了我们在 RR 外推方法改进方面的理论发现。

常步尺度 Q - 学习：分布收敛、偏差和推广

Constant Stepsize Q-learning: Distributional Convergence, Bias and  Extrapolation

A very simple unidimensional function with Lipschitz continuous gradient is
constructed such that the ADAM algorithm with constant stepsize, started from
the origin, diverges when applied to minimize this function in the absence of
noise on the gradient. Divergence occurs irrespective of the choice of the
method parameters.

在没有梯度噪声的情况下，构建了一个具有 Lipschitz 连续梯度的非常简单的一维函数，当应用 ADAM 算法以最小化该函数时，始于原点时会发散，无论选择的方法参数如何。