We study offline Reinforcement Learning in large infinite-horizon discounted
Markov Decision Processes (MDPs) when the reward and transition models are
linearly realizable under a known feature map. Starting from the classic
linear-program formulation of the optimal control problem in MDPs, we develop a
new algorithm that performs a form of gradient ascent in the space of feature
occupancies, defined as the expected feature vectors that can potentially be
generated by executing policies in the environment. We show that the resulting
simple algorithm satisfies strong computational and sample complexity
guarantees, achieved under the least restrictive data coverage assumptions
known in the literature. In particular, we show that the sample complexity of
our method scales optimally with the desired accuracy level and depends on a
weak notion of coverage that only requires the empirical feature covariance
matrix to cover a single direction in the feature space (as opposed to covering
a full subspace). Additionally, our method is easy to implement and requires no
prior knowledge of the coverage ratio (or even an upper bound on it), which
altogether make it the strongest known algorithm for this setting to date.

我们研究了大规模无穷时间折扣马尔可夫决策过程中离线强化学习的问题，当奖励和转移模型在已知特征映射下可线性实现。我们提出了一种新的算法，通过在特征占据空间中进行一种梯度上升的形式来解决这个问题。我们证明了该算法在文献中已知的最不严格的数据覆盖假设下具有强大的计算和样本复杂度保证。此外，我们的方法易于实现，并且不需要关于覆盖比例（甚至上界）的先验知识，这使其成为迄今为止已知的最优算法。