Non-parametric episodic memory can be used to quickly latch onto high-reward
experience in reinforcement learning tasks. In contrast to parametric deep
reinforcement learning approaches, these methods only need to discover the
solution once, and may then repeatedly solve the task. However, episodic
control solutions are stored in discrete tables, and this approach has so far
only been applied to discrete action space problems. Therefore, this paper
introduces Continuous Episodic Control (CEC), a novel non-parametric episodic
memory algorithm for sequential decision making in problems with a continuous
action space. Results on several sparse-reward continuous control environments
show that our proposed method learns faster than state-of-the-art model-free RL
and memory-augmented RL algorithms, while maintaining good long-run performance
as well. In short, CEC can be a fast approach for learning in continuous
control tasks, and a useful addition to parametric RL methods in a hybrid
approach as well.

本文提出了一种新型非参数迭代记忆算法 CEC 来解决具有连续动作空间的顺序决策问题，同时在多个稀疏奖励连续控制环境中表现出比最先进的无模型 RL 和记忆扩展 RL 算法更快的学习速度和更好的长期表现。

连续性情节控制

Continuous Episodic Control

A fascinating aspect of nature lies in its ability to produce a large and
diverse collection of organisms that are all high-performing in their niche. By
contrast, most AI algorithms focus on finding a single efficient solution to a
given problem. Aiming for diversity in addition to performance is a convenient
way to deal with the exploration-exploitation trade-off that plays a central
role in learning. It also allows for increased robustness when the returned
collection contains several working solutions to the considered problem, making
it well-suited for real applications such as robotics. Quality-Diversity (QD)
methods are evolutionary algorithms designed for this purpose. This paper
proposes a novel algorithm, QDPG, which combines the strength of Policy
Gradient algorithms and Quality Diversity approaches to produce a collection of
diverse and high-performing neural policies in continuous control environments.
The main contribution of this work is the introduction of a Diversity Policy
Gradient (DPG) that exploits information at the time-step level to drive
policies towards more diversity in a sample-efficient manner. Specifically,
QDPG selects neural controllers from a MAP-Elites grid and uses two
gradient-based mutation operators to improve both quality and diversity. Our
results demonstrate that QDPG is significantly more sample-efficient than its
evolutionary competitors.

本文提出了一种新算法 QDPG，它结合了策略梯度算法和质量多样性方法，用于在连续控制环境中生成多样化和高性能的神经控制器，并且比其他进化算法更具样本效率。

多样性策略梯度用于高效样本质量多样化优化

Diversity Policy Gradient for Sample Efficient Quality-Diversity  Optimization

Shared autonomy provides an effective framework for human-robot collaboration
that takes advantage of the complementary strengths of humans and robots to
achieve common goals. Many existing approaches to shared autonomy make
restrictive assumptions that the goal space, environment dynamics, or human
policy are known a priori, or are limited to discrete action spaces, preventing
those methods from scaling to complicated real world environments. We propose a
model-free, residual policy learning algorithm for shared autonomy that
alleviates the need for these assumptions. Our agents are trained to minimally
adjust the human's actions such that a set of goal-agnostic constraints are
satisfied. We test our method in two continuous control environments: Lunar
Lander, a 2D flight control domain, and a 6-DOF quadrotor reaching task. In
experiments with human and surrogate pilots, our method significantly improves
task performance without any knowledge of the human's goal beyond the
constraints. These results highlight the ability of model-free deep
reinforcement learning to realize assistive agents suited to continuous control
settings with little knowledge of user intent.

提出了一种模型自由、剩余策略学习算法来实现共享自主，将人与机器人的互补优势结合起来，以实现共同的目标，在 Lunar Lander 和 6-DOF quadrotor reaching task 两个连续控制环境中测试，表明此方法可以显著提高任务绩效。