We prove performance guarantees of two algorithms for approximating $Q^\star$ in batch reinforcement learning. Compared to classical iterative methods such as Fitted Q-Iteration---whose performance loss incurs quadratic dependence on horizon---these methods estimate (some forms of) the Bellman error and enjoy linear-in-horizon error propagation, a property established for the first time for algorithms that rely solely on batch data and output stationary policies. One of the algorithms uses a novel and explicit importance-weighting correction to overcome the infamous "double sampling" difficulty in Bellman error estimation, and does not use any squared losses. Our analyses reveal its distinct characteristics and potential advantages compared to classical algorithms.

本文涵盖了两种用于近似Q星算法在批量强化学习中表现的性能保证，并与传统的迭代方法进行了比较，证明了这些方法可以通过估计贝尔曼误差，仅依靠批数据和输出静态策略的算法，享受与任务无关的线性迭代时间性质。 其中一种算法使用了新颖而明确的重要性加权校正，以克服贝尔曼误差估计中的“双重抽样”难题，并且没有使用任何平方损失。 我们的分析揭示了与传统算法相比，其不同的特点和潜在优势。

批量强化学习中的Q*逼近算法：一个理论比较