Context detection involves labeling segments of an online stream of data as
belonging to different tasks. Task labels are used in lifelong learning
algorithms to perform consolidation or other procedures that prevent
catastrophic forgetting. Inferring task labels from online experiences remains
a challenging problem. Most approaches assume finite and low-dimension
observation spaces or a preliminary training phase during which task labels are
learned. Moreover, changes in the transition or reward functions can be
detected only in combination with a policy, and therefore are more difficult to
detect than changes in the input distribution. This paper presents an approach
to learning both policies and labels in an online deep reinforcement learning
setting. The key idea is to use distance metrics, obtained via optimal
transport methods, i.e., Wasserstein distance, on suitable latent action-reward
spaces to measure distances between sets of data points from past and current
streams. Such distances can then be used for statistical tests based on an
adapted Kolmogorov-Smirnov calculation to assign labels to sequences of
experiences. A rollback procedure is introduced to learn multiple policies by
ensuring that only the appropriate data is used to train the corresponding
policy. The combination of task detection and policy deployment allows for the
optimization of lifelong reinforcement learning agents without an oracle that
provides task labels. The approach is tested using two benchmarks and the
results show promising performance when compared with related context detection
algorithms. The results suggest that optimal transport statistical methods
provide an explainable and justifiable procedure for online context detection
and reward optimization in lifelong reinforcement learning.

在线的深度强化学习环境中，使用最优输运方法中的距离度量来测量过去和当前数据流中数据点组合之间的距离，并使用经过适应的 Kolmogorov-Smirnov 计算进行统计测试，以为经验序列分配标签。任务检测和策略部署的结合允许优化终身强化学习代理，无需提供任务标签的 oracle。该方法在两个基准测试中得到验证，结果表明与相关上下文检测算法相比，最优输运统计方法为在线上下文检测和奖励优化提供了可解释且合理的程序。

深度终身强化学习的统计上下文检测

Statistical Context Detection for Deep Lifelong Reinforcement Learning

The sparsity of reward feedback remains a challenging problem in online deep
reinforcement learning (DRL). Previous approaches have utilized temporal credit
assignment (CA) to achieve impressive results in multiple hard tasks. However,
many CA methods relied on complex architectures or introduced sensitive
hyperparameters to estimate the impact of state-action pairs. Meanwhile, the
premise of the feasibility of CA methods is to obtain trajectories with sparse
rewards, which can be troublesome in sparse-reward environments with large
state spaces. To tackle these problems, we propose a simple and efficient
algorithm called Policy Optimization with Smooth Guidance (POSG) that leverages
a small set of sparse-reward demonstrations to make reliable and effective
long-term credit assignments while efficiently facilitating exploration. The
key idea is that the relative impact of state-action pairs can be indirectly
estimated using offline demonstrations rather than directly leveraging the
sparse reward trajectories generated by the agent. Specifically, we first
obtain the trajectory importance by considering both the trajectory-level
distance to demonstrations and the returns of the relevant trajectories. Then,
the guidance reward is calculated for each state-action pair by smoothly
averaging the importance of the trajectories through it, merging the
demonstration's distribution and reward information. We theoretically analyze
the performance improvement bound caused by smooth guidance rewards and derive
a new worst-case lower bound on the performance improvement. Extensive results
demonstrate POSG's significant advantages in control performance and
convergence speed compared to benchmark DRL algorithms. Notably, the specific
metrics and quantifiable results are investigated to demonstrate the
superiority of POSG.

通过使用离线演示算法，提出了一种名为 Policy Optimization with Smooth Guidance (POSG) 的简单高效的在线深度强化学习算法，该算法能够解决奖励反馈稀疏性的问题，并在稀疏奖励环境中实现可靠有效的长期信用分配以及有效的探索。

使用从稀疏奖励演示中学到的平滑引导奖励的策略优化

Policy Optimization with Smooth Guidance Rewards Learned from  Sparse-Reward Demonstrations

We propose the first black-box targeted attack against online deep
reinforcement learning through reward poisoning during training time. Our
attack is applicable to general environments with unknown dynamics learned by
unknown algorithms and requires limited attack budgets and computational
resources. We leverage a general framework and find conditions to ensure
efficient attack under a general assumption of the learning algorithms. We show
that our attack is optimal in our framework under the conditions. We
experimentally verify that with limited budgets, our attack efficiently leads
the learning agent to various target policies under a diverse set of popular
DRL environments and state-of-the-art learners.

本文提出了一种针对在线深度增强学习的黑盒定向攻击方法，通过在训练时进行奖励污染，攻击突破了未知环境和未知算法的限制，并且攻击成本较低。作者通过实验验证，在不同的环境和学习器中，攻击可以高效地导致学习代理到达各种目标策略。