Large language models (LLMs) aligned to human preferences via reinforcement
learning from human feedback (RLHF) underpin many commercial applications.
However, how RLHF impacts LLM internals remains opaque. We propose a novel
method to interpret learned reward functions in RLHF-tuned LLMs using sparse
autoencoders. Our approach trains autoencoder sets on activations from a base
LLM and its RLHF-tuned version. By comparing autoencoder hidden spaces, we
identify unique features that reflect the accuracy of the learned reward model.
To quantify this, we construct a scenario where the tuned LLM learns
token-reward mappings to maximize reward. This is the first application of
sparse autoencoders for interpreting learned rewards and broadly inspecting
reward learning in LLMs. Our method provides an abstract approximation of
reward integrity. This presents a promising technique for ensuring alignment
between specified objectives and model behaviors.

通过稀疏自编码器解释强化学习调整的大型语言模型中的学习奖励机制，进一步检查语言模型中的奖励学习，以确保目标与模型行为之间的一致性。

使用稀疏自编码器解释 RLHF 调整的语言模型中的奖励模型

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse  Autoencoders

Using learned reward functions (LRFs) as a means to solve sparse-reward
reinforcement learning (RL) tasks has yielded some steady progress in
task-complexity through the years. In this work, we question whether today's
LRFs are best-suited as a direct replacement for task rewards. Instead, we
propose leveraging the capabilities of LRFs as a pretraining signal for RL.
Concretely, we propose $\textbf{LA}$nguage Reward $\textbf{M}$odulated
$\textbf{P}$retraining (LAMP) which leverages the zero-shot capabilities of
Vision-Language Models (VLMs) as a $\textit{pretraining}$ utility for RL as
opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to
scalably generate noisy, albeit shaped exploration rewards by computing the
contrastive alignment between a highly diverse collection of language
instructions and the image observations of an agent in its pretraining
environment. LAMP optimizes these rewards in conjunction with standard
novelty-seeking exploration rewards with reinforcement learning to acquire a
language-conditioned, pretrained policy. Our VLM pretraining approach, which is
a departure from previous attempts to use LRFs, can warmstart sample-efficient
learning on robot manipulation tasks in RLBench.

使用基于学习的奖励函数（LRFs）作为解决稀疏奖励强化学习（RL）任务的手段已经在任务复杂性方面取得了一些稳定的进展。本文提出了一种将 LRFs 作为 RL 的预训练信号的方法，即 $	extbf {LA}$nguage Reward $	extbf {M}$odulated $	extbf {P}$retraining (LAMP)，其利用 Vision-Language Models (VLMs) 的零样本能力作为 RL 的预训练工具，而不是作为下游任务奖励。通过计算大量语言指令与代理器环境中的图像观察之间的对比对齐，LAMP 使用冻结的预训练 VLM 生成嘈杂但有形状的探索奖励。LAMP 与强化学习中的寻求新颖性的探索奖励一起优化这些奖励，以获得受语言条件约束的预训练策略。我们的 VLM 预训练方法与以前使用 LRFs 的方法不同，可以在 RLBench 的机器人操作任务上启动样本效率高的学习。

语言奖励调节预训练强化学习

Language Reward Modulation for Pretraining Reinforcement Learning

The ability to learn reward functions plays an important role in enabling the
deployment of intelligent agents in the real world. However, comparing reward
functions, for example as a means of evaluating reward learning methods,
presents a challenge. Reward functions are typically compared by considering
the behavior of optimized policies, but this approach conflates deficiencies in
the reward function with those of the policy search algorithm used to optimize
it. To address this challenge, Gleave et al. (2020) propose the
Equivalent-Policy Invariant Comparison (EPIC) distance. EPIC avoids policy
optimization, but in doing so requires computing reward values at transitions
that may be impossible under the system dynamics. This is problematic for
learned reward functions because it entails evaluating them outside of their
training distribution, resulting in inaccurate reward values that we show can
render EPIC ineffective at comparing rewards. To address this problem, we
propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
DARD uses an approximate transition model of the environment to transform
reward functions into a form that allows for comparisons that are invariant to
reward shaping while only evaluating reward functions on transitions close to
their training distribution. Experiments in simulated physical domains
demonstrate that DARD enables reliable reward comparisons without policy
optimization and is significantly more predictive than baseline methods of
downstream policy performance when dealing with learned reward functions.

学会学习奖励函数对于让智能代理在现实世界中得以应用非常重要。本研究通过提出 Equivalent-Policy Invariant Comparison (EPIC) 距离，解决了评估学习奖励方法的难题。同时，提出了 Dynamics-Aware Reward Distance (DARD) 这个新的奖励伪度量，使得对于奖励函数的比较在奖励塑形领域能够更加可靠。实验表明，基于 DARD 的奖励比较方法不需要进行策略优化，且在应对学习奖励函数时比基线方法更具有预测性。