Direct policy gradient methods for reinforcement learning are a successful
approach for a variety of reasons: they are model free, they directly optimize
the performance metric of interest, and they allow for richly parameterized
policies. Their primary drawback is that, by being local in nature, they fail
to adequately explore the environment. In contrast, while model-based
approaches and Q-learning directly handle exploration through the use of
optimism, their ability to handle model misspecification and function
approximation is far less evident. This work introduces the the Policy
Cover-Policy Gradient (PC-PG) algorithm, which provably balances the
exploration vs. exploitation tradeoff using an ensemble of learned policies
(the policy cover). PC-PG enjoys polynomial sample complexity and run time for
both tabular MDPs and, more generally, linear MDPs in an infinite dimensional
RKHS. Furthermore, PC-PG also has strong guarantees under model
misspecification that go beyond the standard worst case $\ell_{\infty}$
assumptions; this includes approximation guarantees for state aggregation under
an average case error assumption, along with guarantees under a more general
assumption where the approximation error under distribution shift is
controlled. We complement the theory with empirical evaluation across a variety
of domains in both reward-free and reward-driven settings.

介绍了 Policy Cover-Policy Gradient (PC-PG) 算法，其通过学习的策略集 (策略保证) 来平衡探索和开发的权衡，同时具有强大的模型误差优化保证