Reward hacking occurs when an agent performs very well with respect to a
"proxy" reward function (which may be hand-specified or learned), but poorly
with respect to the unknown true reward. Since ensuring good alignment between
the proxy and true reward is extremely difficult, one approach to prevent
reward hacking is optimizing the proxy conservatively. Prior work has
particularly focused on enforcing the learned policy to behave similarly to a
"safe" policy by penalizing the KL divergence between their action
distributions (AD). However, AD regularization doesn't always work well since a
small change in action distribution at a single state can lead to potentially
calamitous outcomes, while large changes might not be indicative of any
dangerous activity. Our insight is that when reward hacking, the agent visits
drastically different states from those reached by the safe policy, causing
large deviations in state occupancy measure (OM). Thus, we propose regularizing
based on the OM divergence between policies instead of AD divergence to prevent
reward hacking. We theoretically establish that OM regularization can more
effectively avoid large drops in true reward. Then, we empirically demonstrate
in a variety of realistic environments that OM divergence is superior to AD
divergence for preventing reward hacking by regularizing towards a safe policy.
Furthermore, we show that occupancy measure divergence can also regularize
learned policies away from reward hacking behavior. Our code and data are
available at this https URL

奖励破解是指代理根据「代理」奖励函数（可以是手动指定或学习得到的函数）表现出色，但在未知真实奖励方面表现差。我们提出根据状态占据度测量建立正则化模型，以代替行动分布来避免奖励破解，并进行理论和实证研究验证。

通过占用度量规范化防止奖励攻击

Preventing Reward Hacking with Occupancy Measure Regularization

Many real-world problems require trading off multiple competing objectives.
However, these objectives are often in different units and/or scales, which can
make it challenging for practitioners to express numerical preferences over
objectives in their native units. In this paper we propose a novel algorithm
for multi-objective reinforcement learning that enables setting desired
preferences for objectives in a scale-invariant way. We propose to learn an
action distribution for each objective, and we use supervised learning to fit a
parametric policy to a combination of these distributions. We demonstrate the
effectiveness of our approach on challenging high-dimensional real and
simulated robotics tasks, and show that setting different preferences in our
framework allows us to trace out the space of nondominated solutions.

本文提出了一种用于多目标强化学习的新算法，可以以一种无量纲的方式设置目标的偏好，并且通过学习行动分布和拟合参数策略来在高维实际机器人任务及模拟任务中展示了其有效性，从而找到一组非支配解空间。