The endeavor of artificial intelligence (AI) is to design autonomous agents
capable of achieving complex tasks. Namely, reinforcement learning (RL)
proposes a theoretical background to learn optimal behaviors. In practice, RL
algorithms rely on geometric discounts to evaluate this optimality.
Unfortunately, this does not cover decision processes where future returns are
not exponentially less valuable. Depending on the problem, this limitation
induces sample-inefficiency (as feed-backs are exponentially decayed) and
requires additional curricula/exploration mechanisms (to deal with sparse,
deceptive or adversarial rewards). In this paper, we tackle these issues by
generalizing the discounted problem formulation with a family of delayed
objective functions. We investigate the underlying RL problem to derive: 1) the
optimal stationary solution and 2) an approximation of the optimal
non-stationary control. The devised algorithms solved hard exploration problems
on tabular environment and improved sample-efficiency on classic simulated
robotics benchmarks.

通过推广折扣问题的公式，使用延迟目标函数家族解决通过强化学习问题中存在的样本低效和探索问题，并用所设计的算法成功地解决了硬的探索问题和改善了经典模拟机器人基准测试的样本效率。

延迟几何折扣：强化学习的另一种准则

Delayed Geometric Discounts: An Alternative Criterion for Reinforcement Learning

Operating in the real-world often requires agents to learn about a complex
environment and apply this understanding to achieve a breadth of goals. This
problem, known as goal-conditioned reinforcement learning (GCRL), becomes
especially challenging for long-horizon goals. Current methods have tackled
this problem by augmenting goal-conditioned policies with graph-based planning
algorithms. However, they struggle to scale to large, high-dimensional state
spaces and assume access to exploration mechanisms for efficiently collecting
training data. In this work, we introduce Successor Feature Landmarks (SFL), a
framework for exploring large, high-dimensional environments so as to obtain a
policy that is proficient for any goal. SFL leverages the ability of successor
features (SF) to capture transition dynamics, using it to drive exploration by
estimating state-novelty and to enable high-level planning by abstracting the
state-space as a non-parametric landmark-based graph. We further exploit SF to
directly compute a goal-conditioned policy for inter-landmark traversal, which
we use to execute plans to "frontier" landmarks at the edge of the explored
state space. We show in our experiments on MiniGrid and ViZDoom that SFL
enables efficient exploration of large, high-dimensional state spaces and
outperforms state-of-the-art baselines on long-horizon GCRL tasks.

本文介绍 Successor Feature Landmarks（SFL），它是用于大型、高维空间的探索的一个框架，该框架利用继承特征（SF）的能力来驱动探索，估计状态新颖性，并通过将状态空间抽象为基于非参数地标的图表，启用高级别规划，实现了 GCRL 任务上的超越表现。

基于后继特征标志的长视程目标导向强化学习

Successor Feature Landmarks for Long-Horizon Goal-Conditioned  Reinforcement Learning

We study neural-linear bandits for solving problems where {\em both}
exploration and representation learning play an important role. Neural-linear
bandits harnesses the representation power of Deep Neural Networks (DNNs) and
combines it with efficient exploration mechanisms by leveraging uncertainty
estimation of the model, designed for linear contextual bandits on top of the
last hidden layer. In order to mitigate the problem of representation change
during the process, new uncertainty estimations are computed using stored data
from an unlimited buffer. Nevertheless, when the amount of stored data is
limited, a phenomenon called catastrophic forgetting emerges. To alleviate
this, we propose a likelihood matching algorithm that is resilient to
catastrophic forgetting and is completely online. We applied our algorithm,
Limited Memory Neural-Linear with Likelihood Matching (NeuralLinear-LiM2) on a
variety of datasets and observed that our algorithm achieves comparable
performance to the unlimited memory approach while exhibits resilience to
catastrophic forgetting.

本文研究神经线性赌博机，结合深度神经网络的表示能力和置信度估计机制，应用于线性环境赌博机中，通过匹配似然算法与去遗忘性相结合，取得了类似于无限存储方法的性能，而且对于遗忘性表现出了很强的韧性。