We study the problem of agnostic PAC reinforcement learning (RL): given a
policy class $\Pi$, how many rounds of interaction with an unknown MDP (with a
potentially large state and action space) are required to learn an
$\epsilon$-suboptimal policy with respect to $\Pi$? Towards that end, we
introduce a new complexity measure, called the \emph{spanning capacity}, that
depends solely on the set $\Pi$ and is independent of the MDP dynamics. With a
generative model, we show that for any policy class $\Pi$, bounded spanning
capacity characterizes PAC learnability. However, for online RL, the situation
is more subtle. We show there exists a policy class $\Pi$ with a bounded
spanning capacity that requires a superpolynomial number of samples to learn.
This reveals a surprising separation for agnostic learnability between
generative access and online access models (as well as between
deterministic/stochastic MDPs under online access). On the positive side, we
identify an additional \emph{sunflower} structure, which in conjunction with
bounded spanning capacity enables statistically efficient online RL via a new
algorithm called POPLER, which takes inspiration from classical importance
sampling methods as well as techniques for reachable-state identification and
policy evaluation in reward-free exploration.

我们研究了对所有政策类 Pi 进行不可知 PAC 强化学习问题：在与一个未知的具有潜在庞大状态和动作空间的 MDP 交互的情况下，需要多少轮才能学习到相对于 Pi 的 epsilon - 次优政策？为此，我们引入了一种新的复杂性度量，称为生成能力，它仅依赖于政策类 Pi 而与 MDP 动力学无关。通过一个生成模型，我们证明了对于任何政策类 Pi，有界的生成能力表征了 PAC 可学习性。然而，对于在线 RL 来说，情况要复杂些。我们展示了存在一个具有有界生成能力的政策类 Pi，需要超多项式数量的样本来进行学习。这揭示了在生成访问和在线访问模型之间（以及在线访问下的确定性 / 随机 MDPs 之间）对于不可知学习能力的令人惊讶的区别。在积极方面，我们确定了一种额外的向日葵结构，它与有界生成能力一起，通过一种名为 POPLER 的新算法实现了统计高效的在线 RL，该算法借鉴了经典的重要性采样方法以及无奖励探索中可达状态识别和政策评估技术。