We revisit the problem of offline reinforcement learning with value function
realizability but without Bellman completeness. Previous work by Xie and Jiang
(2021) and Foster et al. (2022) left open the question whether a bounded
concentrability coefficient along with trajectory-based offline data admits a
polynomial sample complexity. In this work, we provide a negative answer to
this question for the task of offline policy evaluation. In addition to
addressing this question, we provide a rather complete picture for offline
policy evaluation with only value function realizability. Our primary findings
are threefold: 1) The sample complexity of offline policy evaluation is
governed by the concentrability coefficient in an aggregated Markov Transition
Model jointly determined by the function class and the offline data
distribution, rather than that in the original MDP. This unifies and
generalizes the ideas of Xie and Jiang (2021) and Foster et al. (2022), 2) The
concentrability coefficient in the aggregated Markov Transition Model may grow
exponentially with the horizon length, even when the concentrability
coefficient in the original MDP is small and the offline data is admissible
(i.e., the data distribution equals the occupancy measure of some policy), 3)
Under value function realizability, there is a generic reduction that can
convert any hard instance with admissible data to a hard instance with
trajectory data, implying that trajectory data offers no extra benefits over
admissible data. These three pieces jointly resolve the open problem, though
each of them could be of independent interest.

离线强化学习中，对于具有价值函数的可实现性但不具备 Bellman 完备性的问题，我们提供了关于离线策略评估任务的负回答，并揭示了聚合马尔可夫转移模型中的集中度系数在样本复杂性中的重要性，即使原始 MDP 中的集中度系数较小且离线数据可接受，聚合的集中度系数仍可能呈指数增长，而轨迹数据相对于可接受的数据并没有额外的好处。