Collecting and leveraging data with good coverage properties plays a crucial
role in different aspects of reinforcement learning (RL), including reward-free
exploration and offline learning. However, the notion of "good coverage" really
depends on the application at hand, as data suitable for one context may not be
so for another. In this paper, we formalize the problem of active coverage in
episodic Markov decision processes (MDPs), where the goal is to interact with
the environment so as to fulfill given sampling requirements. This framework is
sufficiently flexible to specify any desired coverage property, making it
applicable to any problem that involves online exploration. Our main
contribution is an instance-dependent lower bound on the sample complexity of
active coverage and a simple game-theoretic algorithm, CovGame, that nearly
matches it. We then show that CovGame can be used as a building block to solve
different PAC RL tasks. In particular, we obtain a simple algorithm for PAC
reward-free exploration with an instance-dependent sample complexity that, in
certain MDPs which are "easy to explore", is lower than the minimax one. By
further coupling this exploration algorithm with a new technique to do implicit
eliminations in policy space, we obtain a computationally-efficient algorithm
for best-policy identification whose instance-dependent sample complexity
scales with gaps between policy values.

本研究提出了一个灵活的框架来解决强化学习过程中数据覆盖率问题，并通过 CovGame 算法来匹配最低采样复杂度，进而解决了不同的演示性增强学习任务问题。

PAC 强化学习的主动覆盖

Active Coverage for PAC Reinforcement Learning

Recent work on exploration in reinforcement learning (RL) has led to a series
of increasingly complex solutions to the problem. This increase in complexity
often comes at the expense of generality. Recent empirical studies suggest
that, when applied to a broader set of domains, some sophisticated exploration
methods are outperformed by simpler counterparts, such as {\epsilon}-greedy. In
this paper we propose an exploration algorithm that retains the simplicity of
{\epsilon}-greedy while reducing dithering. We build on a simple hypothesis:
the main limitation of {\epsilon}-greedy exploration is its lack of temporal
persistence, which limits its ability to escape local optima. We propose a
temporally extended form of {\epsilon}-greedy that simply repeats the sampled
action for a random duration. It turns out that, for many duration
distributions, this suffices to improve exploration on a large set of domains.
Interestingly, a class of distributions inspired by ecological models of animal
foraging behaviour yields particularly strong performance.

本文提出了一种基于时域的 ε- 贪心探索算法，通过重复随机采样的行为来提高探索效果，该算法在许多不同领域都有良好的表现。

时间延长下的 ε- 贪心探索

Temporally-Extended ε-Greedy Exploration

We present a new algorithm that significantly improves the efficiency of
exploration for deep Q-learning agents in dialogue systems. Our agents explore
via Thompson sampling, drawing Monte Carlo samples from a Bayes-by-Backprop
neural network. Our algorithm learns much faster than common exploration
strategies such as \epsilon-greedy, Boltzmann, bootstrapping, and
intrinsic-reward-based ones. Additionally, we show that spiking the replay
buffer with experiences from just a few successful episodes can make Q-learning
feasible when it might otherwise fail.

提出了一种新的探索算法，基于 Bayes-by-Backprop 神经网络和重放缓冲区，可以大大提高深度 Q 学习在对话系统中的效率并比传统的探索策略学习得更快。