We present $\varepsilon$-retrain, an exploration strategy designed to
encourage a behavioral preference while optimizing policies with monotonic
improvement guarantees. To this end, we introduce an iterative procedure for
collecting retrain areas -- parts of the state space where an agent did not
follow the behavioral preference. Our method then switches between the typical
uniform restart state distribution and the retrain areas using a decaying
factor $\varepsilon$, allowing agents to retrain on situations where they
violated the preference. Experiments over hundreds of seeds across locomotion,
navigation, and power network tasks show that our method yields agents that
exhibit significant performance and sample efficiency improvements. Moreover,
we employ formal verification of neural networks to provably quantify the
degree to which agents adhere to behavioral preferences.

我们提出了一种名为 ε- 重新训练的探索策略，该策略旨在在保证政策单调改进的同时鼓励一种行为性偏好。我们介绍了一种收集重新训练区域的迭代过程，即智能体没有遵循行为性偏好的状态空间的部分。我们的方法使用逐渐减小的因子 ε 在常规均匀重启状态分布和重新训练区域之间进行切换，使智能体能够重新训练在违反行为偏好的情况下的情况。在运动、导航和电力网络任务的数百个种子上进行的实验表明，我们的方法产生了表现显著提高的智能体，并且在样本效率上也得到了改进。此外，我们使用神经网络的形式验证来可靠地量化智能体遵守行为性偏好的程度。