Context detection involves labeling segments of an online stream of data as
belonging to different tasks. Task labels are used in lifelong learning
algorithms to perform consolidation or other procedures that prevent
catastrophic forgetting. Inferring task labels from online experiences remains
a challenging problem. Most approaches assume finite and low-dimension
observation spaces or a preliminary training phase during which task labels are
learned. Moreover, changes in the transition or reward functions can be
detected only in combination with a policy, and therefore are more difficult to
detect than changes in the input distribution. This paper presents an approach
to learning both policies and labels in an online deep reinforcement learning
setting. The key idea is to use distance metrics, obtained via optimal
transport methods, i.e., Wasserstein distance, on suitable latent action-reward
spaces to measure distances between sets of data points from past and current
streams. Such distances can then be used for statistical tests based on an
adapted Kolmogorov-Smirnov calculation to assign labels to sequences of
experiences. A rollback procedure is introduced to learn multiple policies by
ensuring that only the appropriate data is used to train the corresponding
policy. The combination of task detection and policy deployment allows for the
optimization of lifelong reinforcement learning agents without an oracle that
provides task labels. The approach is tested using two benchmarks and the
results show promising performance when compared with related context detection
algorithms. The results suggest that optimal transport statistical methods
provide an explainable and justifiable procedure for online context detection
and reward optimization in lifelong reinforcement learning.

在线的深度强化学习环境中，使用最优输运方法中的距离度量来测量过去和当前数据流中数据点组合之间的距离，并使用经过适应的 Kolmogorov-Smirnov 计算进行统计测试，以为经验序列分配标签。任务检测和策略部署的结合允许优化终身强化学习代理，无需提供任务标签的 oracle。该方法在两个基准测试中得到验证，结果表明与相关上下文检测算法相比，最优输运统计方法为在线上下文检测和奖励优化提供了可解释且合理的程序。