Reward hacking occurs when an agent performs very well with respect to a
"proxy" reward function (which may be hand-specified or learned), but poorly
with respect to the unknown true reward. Since ensuring good alignment between
the proxy and true reward is extremely difficult, one approach to prevent
reward hacking is optimizing the proxy conservatively. Prior work has
particularly focused on enforcing the learned policy to behave similarly to a
"safe" policy by penalizing the KL divergence between their action
distributions (AD). However, AD regularization doesn't always work well since a
small change in action distribution at a single state can lead to potentially
calamitous outcomes, while large changes might not be indicative of any
dangerous activity. Our insight is that when reward hacking, the agent visits
drastically different states from those reached by the safe policy, causing
large deviations in state occupancy measure (OM). Thus, we propose regularizing
based on the OM divergence between policies instead of AD divergence to prevent
reward hacking. We theoretically establish that OM regularization can more
effectively avoid large drops in true reward. Then, we empirically demonstrate
in a variety of realistic environments that OM divergence is superior to AD
divergence for preventing reward hacking by regularizing towards a safe policy.
Furthermore, we show that occupancy measure divergence can also regularize
learned policies away from reward hacking behavior. Our code and data are
available at this https URL

奖励破解是指代理根据「代理」奖励函数（可以是手动指定或学习得到的函数）表现出色，但在未知真实奖励方面表现差。我们提出根据状态占据度测量建立正则化模型，以代替行动分布来避免奖励破解，并进行理论和实证研究验证。

通过占用度量规范化防止奖励攻击

Preventing Reward Hacking with Occupancy Measure Regularization

Reward design in reinforcement learning (RL) is challenging since specifying
human notions of desired behavior may be difficult via reward functions or
require many expert demonstrations. Can we instead cheaply design rewards using
a natural language interface? This paper explores how to simplify reward design
by prompting a large language model (LLM) such as GPT-3 as a proxy reward
function, where the user provides a textual prompt containing a few examples
(few-shot) or a description (zero-shot) of the desired behavior. Our approach
leverages this proxy reward function in an RL framework. Specifically, users
specify a prompt once at the beginning of training. During training, the LLM
evaluates an RL agent's behavior against the desired behavior described by the
prompt and outputs a corresponding reward signal. The RL agent then uses this
reward to update its behavior. We evaluate whether our approach can train
agents aligned with user objectives in the Ultimatum Game, matrix games, and
the DealOrNoDeal negotiation task. In all three tasks, we show that RL agents
trained with our framework are well-aligned with the user's objectives and
outperform RL agents trained with reward functions learned via supervised
learning

本文探讨以自然语言接口为代理奖励函数来简化奖励设计，在强化学习框架下利用大型语言模型对用户目标进行培训，实现智能体与用户目标的对齐，并在 Ultimatum 游戏、矩阵游戏和 DealOrNoDeal 谈判任务中优于通过监督式学习学习的奖励函数训练的 RL 智能体。