We offer a theoretical characterization of off-policy evaluation (OPE) in
reinforcement learning using function approximation for marginal importance
weights and $q$-functions when these are estimated using recent minimax
methods. Under various combinations of realizability and completeness
assumptions, we show that the minimax approach enables us to achieve a fast
rate of convergence for weights and quality functions, characterized by the
critical inequality \citep{bartlett2005}. Based on this result, we analyze
convergence rates for OPE. In particular, we introduce novel alternative
completeness conditions under which OPE is feasible and we present the first
finite-sample result with first-order efficiency in non-tabular environments,
i.e., having the minimal coefficient in the leading term.

本文从函数逼近和 $q$ 函数的角度，通过最新的极小极大方法对离线策略评估 (OPE) 在强化学习中进行了理论刻画，并基于此结果分析了 OPE 的收敛速度和新的完备条件，提出了第一种在非表格环境下具有一阶效率的有限样本结果。

极小化离线强化学习的有限样本分析：完备性，快速速率和一阶效率

Finite Sample Analysis of Minimax Offline Reinforcement Learning:  Completeness, Fast Rates and First-Order Efficiency

We study minimax methods for off-policy evaluation (OPE) using value
functions and marginalized importance weights. Despite that they hold promises
of overcoming the exponential variance in traditional importance sampling,
several key problems remain:
(1) They require function approximation and are generally biased. For the
sake of trustworthy OPE, is there anyway to quantify the biases?
(2) They are split into two styles ("weight-learning" vs "value-learning").
Can we unify them?
In this paper we answer both questions positively. By slightly altering the
derivation of previous methods (one from each style; Uehara et al., 2020), we
unify them into a single value interval that comes with a special type of
double robustness: when either the value-function or the importance-weight
class is well specified, the interval is valid and its length quantifies the
misspecification of the other class. Our interval also provides a unified view
of and new insights to some recent methods, and we further explore the
implications of our results on exploration and exploitation in off-policy
policy optimization with insufficient data coverage.

该研究使用价值函数和边际重要性权重研究了最小极小化方法在离线策略评估中的应用，结合两种不同风格的方法，提出了一个特殊类型的双重稳健方法，解决了偏见问题，同时还探索了其在数据覆盖不足的离线策略优化中对探索和开发的影响。