Empowering safe exploration of reinforcement learning (RL) agents during
training is a critical impediment towards deploying RL agents in many
real-world scenarios. Training RL agents in unknown, black-box environments
poses an even greater safety risk when prior knowledge of the domain/task is
unavailable. We introduce ADVICE (Adaptive Shielding with a Contrastive
Autoencoder), a novel post-shielding technique that distinguishes safe and
unsafe features of state-action pairs during training, thus protecting the RL
agent from executing actions that yield potentially hazardous outcomes. Our
comprehensive experimental evaluation against state-of-the-art safe RL
exploration techniques demonstrates how ADVICE can significantly reduce safety
violations during training while maintaining a competitive outcome reward.

在训练过程中，通过使用一种名为 ADVICE 的自适应屏蔽技术，可以识别出状态 - 动作对的安全和不安全特征，从而保护强化学习代理避免执行可能产生危险结果的动作，有效降低安全违规风险。

自适应屏蔽在黑盒环境中的安全强化学习

Safe Reinforcement Learning in Black-Box Environments via Adaptive  Shielding

We introduce the problem of active causal structure learning with advice. In
the typical well-studied setting, the learning algorithm is given the essential
graph for the observational distribution and is asked to recover the underlying
causal directed acyclic graph (DAG) $G^*$ while minimizing the number of
interventions made. In our setting, we are additionally given side information
about $G^*$ as advice, e.g. a DAG $G$ purported to be $G^*$. We ask whether the
learning algorithm can benefit from the advice when it is close to being
correct, while still having worst-case guarantees even when the advice is
arbitrarily bad. Our work is in the same space as the growing body of research
on algorithms with predictions. When the advice is a DAG $G$, we design an
adaptive search algorithm to recover $G^*$ whose intervention cost is at most
$O(\max\{1, \log \psi\})$ times the cost for verifying $G^*$; here, $\psi$ is a
distance measure between $G$ and $G^*$ that is upper bounded by the number of
variables $n$, and is exactly 0 when $G=G^*$. Our approximation factor matches
the state-of-the-art for the advice-less setting.

本研究介绍了带有辅助意见的主动因果结构学习问题，提出了一种新的搜索算法来回复有向无环图，其干预成本最多为验证图的成本的 O（max {1，logψ}）倍，其中 ψ 是 G 和 G * 之间的距离度量，当 G=G * 时恰好为 0。

基于导师的主动因果结构学习

Active causal structure learning with advice

$k$-means clustering is a well-studied problem due to its wide applicability.
Unfortunately, there exist strong theoretical limits on the performance of any
algorithm for the $k$-means problem on worst-case inputs. To overcome this
barrier, we consider a scenario where "advice" is provided to help perform
clustering. Specifically, we consider the $k$-means problem augmented with a
predictor that, given any point, returns its cluster label in an approximately
optimal clustering up to some, possibly adversarial, error. We present an
algorithm whose performance improves along with the accuracy of the predictor,
even though na\"{i}vely following the accurate predictor can still lead to a
high clustering cost. Thus if the predictor is sufficiently accurate, we can
retrieve a close to optimal clustering with nearly optimal runtime, breaking
known computational barriers for algorithms that do not have access to such
advice. We evaluate our algorithms on real datasets and show significant
improvements in the quality of clustering.

通过引入预测器，本文提出了一种新的 $k$-means 聚类算法，可以提高聚类的质量和效率，打破了此前关于 $k$-means 问题的计算难点。

学习增强 K 均值聚类

Learning-Augmented $k$-means Clustering

In this article we study the transfer learning model of action advice under a
budget. We focus on reinforcement learning teachers providing action advice to
heterogeneous students playing the game of Pac-Man under a limited advice
budget. First, we examine several critical factors affecting advice quality in
this setting, such as the average performance of the teacher, its variance and
the importance of reward discounting in advising. The experiments show the
non-trivial importance of the coefficient of variation (CV) as a statistic for
choosing policies that generate advice. The CV statistic relates variance to
the corresponding mean. Second, the article studies policy learning for
distributing advice under a budget. Whereas most methods in the relevant
literature rely on heuristics for advice distribution we formulate the problem
as a learning one and propose a novel RL algorithm capable of learning when to
advise, adapting to the student and the task at hand. Furthermore, we argue
that learning to advise under a budget is an instance of a more generic
learning problem: Constrained Exploitation Reinforcement Learning.

本文研究了在预算下动作建议的转移学习模型。我们关注于在有限建议预算下，强化学习教师向异质学生提供游戏 Pac-Man 的行动建议。第一，我们研究了影响此设置下建议质量的几个关键因素，例如教师的平均表现，其方差以及奖励折现在建议中的重要性。实验表明，变异系数 (CV) 作为选择生成建议策略的统计量具有重要的非平凡意义。第二，本文研究了在预算下分配建议的策略学习。虽然相关文献中大多数方法都依赖于启发式方法进行建议分配，但我们将问题进行学习，并提出了一种新的强化学习算法，该算法能够学习何时提供建议，适应于学生和手头的任务。此外，我们认为在预算下学习建议是一个更通用的学习问题的例子：受限开发强化学习。