Offline or batch reinforcement learning seeks to learn a near-optimal policy
using history data without active exploration of the environment. To counter
the insufficient coverage and sample scarcity of many offline datasets, the
principle of pessimism has been recently introduced to mitigate high bias of
the estimated values. While pessimistic variants of model-based algorithms
(e.g., value iteration with lower confidence bounds) have been theoretically
investigated, their model-free counterparts -- which do not require explicit
model estimation -- have not been adequately studied, especially in terms of
sample efficiency. To address this inadequacy, we study a pessimistic variant
of Q-learning in the context of finite-horizon Markov decision processes, and
characterize its sample complexity under the single-policy concentrability
assumption which does not require the full coverage of the state-action space.
In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to
achieve near-optimal sample complexity. Altogether, this work highlights the
efficiency of model-free algorithms in offline RL when used in conjunction with
pessimism and variance reduction.

本文研究了离线强化学习的一个悲观策略 Q-learning，针对有限时间的马尔科夫决策过程，通过单一策略密度函数的集中性假设，对其样本复杂度进行了表征，并提出了一种方差减小的悲观 Q-learning 算法来达到接近最优的样本复杂度。研究结果表明，在离线强化学习中，结合悲观策略和方差减小的模型无关型算法能够提高效率。