Learning high-quality Q-value functions plays a key role in the success of
many modern off-policy deep reinforcement learning (RL) algorithms. Previous
works focus on addressing the value overestimation issue, an outcome of
adopting function approximators and off-policy learning. Deviating from the
common viewpoint, we observe that Q-values are indeed underestimated in the
latter stage of the RL training process, primarily related to the use of
inferior actions from the current policy in Bellman updates as compared to the
more optimal action samples in the replay buffer. We hypothesize that this
long-neglected phenomenon potentially hinders policy learning and reduces
sample efficiency. Our insight to address this issue is to incorporate
sufficient exploitation of past successes while maintaining exploration
optimism. We propose the Blended Exploitation and Exploration (BEE) operator, a
simple yet effective approach that updates Q-value using both historical
best-performing actions and the current policy. The instantiations of our
method in both model-free and model-based settings outperform state-of-the-art
methods in various continuous control tasks and achieve strong performance in
failure-prone scenarios and real-world robot tasks.

提出了混合利用和探索算法（BEE）来解决强化学习后期出现的低估 Q 值问题，具有较高的样本效率和实用性。

抓住意外收获：利用往期成功价值进行非同策略演员 - 评论家算法

Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy  Actor-Critic

An inherent problem in reinforcement learning is coping with policies that
are uncertain about what action to take (or the value of a state). Model
uncertainty, more formally known as epistemic uncertainty, refers to the
expected prediction error of a model beyond the sampling noise. In this paper,
we propose a metric for epistemic uncertainty estimation in Q-value functions,
which we term pathwise epistemic uncertainty. We further develop a method to
compute its approximate upper bound, which we call F -value. We experimentally
apply the latter to Deep Q-Networks (DQN) and show that uncertainty estimation
in reinforcement learning serves as a useful indication of learning progress.
We then propose a new approach to improving sample efficiency in actor-critic
algorithms by learning from an existing (previously learned or hard-coded)
oracle policy while uncertainty is high, aiming to avoid unproductive random
actions during training. We term this Critic Confidence Guided Exploration
(CCGE). We implement CCGE on Soft Actor-Critic (SAC) using our F-value metric,
which we apply to a handful of popular Gym environments and show that it
achieves better sample efficiency and total episodic reward than vanilla SAC in
limited contexts.

本文提出并应用一种度量 Q - 值函数中认知不确定性的度量标准，称为路径认知不确定性，并开发了一种计算其近似上限的方法 F - 值。我们在 Deep Q-Networks (DQN) 中实验性地应用其来表明在强化学习中的不确定性估计是学习进展的有用指标，并提出了基于 CritiC 的置信度引导探索（CCGE）的新方法，以在不确定性高时从现有（之前学习或预先编码）的 oracle 策略中学习，以避免训练期间无效的随机动作。然后我们应用该方法到 Soft Actor-Critic (SAC)，并在几个常见的 Gym 环境中表明它比普通 SAC 表现更好。

需要一些监管：通过认知不确定性指标在强化学习中融入 Oracle 政策

Some Supervision Required: Incorporating Oracle Policies in Reinforcement Learning via Epistemic Uncertainty Metrics

We consider a Reinforcement Learning setup where an agent interacts with an
environment in observation-reward-action cycles without any (esp.\ MDP)
assumptions on the environment. State aggregation and more generally feature
reinforcement learning is concerned with mapping histories/raw-states to
reduced/aggregated states. The idea behind both is that the resulting reduced
process (approximately) forms a small stationary finite-state MDP, which can
then be efficiently solved or learnt. We considerably generalize existing
aggregation results by showing that even if the reduced process is not an MDP,
the (q-)value functions and (optimal) policies of an associated MDP with same
state-space size solve the original problem, as long as the solution can
approximately be represented as a function of the reduced states. This implies
an upper bound on the required state space size that holds uniformly for all RL
problems. It may also explain why RL algorithms designed for MDPs sometimes
perform well beyond MDPs.

研究了强化学习中的状态聚合及特征学习，通过提出基于聚合过程的马尔科夫决策过程，推广了现有的聚合结果，解决了强化学习中状态空间大小的上限问题。

马尔可夫决策过程之外的极限状态聚合

Extreme State Aggregation Beyond MDPs

Decision-theoretic planning is a popular approach to sequential decision
making problems, because it treats uncertainty in sensing and acting in a
principled way. In single-agent frameworks like MDPs and POMDPs, planning can
be carried out by resorting to Q-value functions: an optimal Q-value function
Q* is computed in a recursive manner by dynamic programming, and then an
optimal policy is extracted from Q*. In this paper we study whether similar
Q-value functions can be defined for decentralized POMDP models (Dec-POMDPs),
and how policies can be extracted from such value functions. We define two
forms of the optimal Q-value function for Dec-POMDPs: one that gives a
normative description as the Q-value function of an optimal pure joint policy
and another one that is sequentially rational and thus gives a recipe for
computation. This computation, however, is infeasible for all but the smallest
problems. Therefore, we analyze various approximate Q-value functions that
allow for efficient computation. We describe how they relate, and we prove that
they all provide an upper bound to the optimal Q-value function Q*. Finally,
unifying some previous approaches for solving Dec-POMDPs, we describe a family
of algorithms for extracting policies from such Q-value functions, and perform
an experimental evaluation on existing test problems, including a new
firefighting benchmark problem.

本文研究决策理论规划在单智能体和分布式 POMDP 模型中的应用，提出了一种可行的计算方法并对其算法进行了评估。