Aligning Large Language Models (LLMs) to cater to different human
preferences, learning new skills, and unlearning harmful behavior is an
important problem. Search-based methods, such as Best-of-N or Monte-Carlo Tree
Search, are performant, but impractical for LLM adaptation due to their high
inference cost. On the other hand, using Reinforcement Learning (RL) for
adaptation is computationally efficient, but performs worse due to the
optimization challenges in co-training the value function and the policy. We
present a new framework for reward optimization, Value Augmented Sampling
(VAS), that can maximize different reward functions using data sampled from
only the initial, frozen LLM. VAS solves for the optimal reward-maximizing
policy without co-training the policy and the value function, making the
optimization stable, outperforming established baselines, such as PPO and DPO,
on standard benchmarks, and achieving comparable results to Best-of-128 with
lower inference cost. Unlike existing RL methods that require changing the
weights of the LLM, VAS does not require access to the weights of the
pre-trained LLM. Thus, it can even adapt LLMs (e.g., ChatGPT), which are
available only as APIs. In addition, our algorithm unlocks the new capability
of composing several rewards and controlling the extent of each one during
deployment time, paving the road ahead for the future of aligned, personalized
LLMs.

通过价值增强抽样（VAS）的奖励优化框架，不需要共同训练策略和值函数的情况下，最大化不同奖励函数，相较于现有基线模型，在标准基准测试中不仅优于 PPO 和 DPO，而且与 Best-of-128 相比具有更低的推理成本，从而实现了优化的稳定性，并能适应仅作为 API 提供的 LLMs（例如 ChatGPT），同时为对齐的个性化 LLMs 的未来铺平道路。

增值取样用于语言模型对齐和个性化

Value Augmented Sampling for Language Model Alignment and  Personalization

Deep reinforcement learning (RL) methods generally engage in exploratory
behavior through noise injection in the action space. An alternative is to add
noise directly to the agent's parameters, which can lead to more consistent
exploration and a richer set of behaviors. Methods such as evolutionary
strategies use parameter perturbations, but discard all temporal structure in
the process and require significantly more samples. Combining parameter noise
with traditional RL methods allows to combine the best of both worlds. We
demonstrate that both off- and on-policy methods benefit from this approach
through experimental comparison of DQN, DDPG, and TRPO on high-dimensional
discrete action environments as well as continuous control tasks. Our results
show that RL with parameter noise learns more efficiently than traditional RL
with action space noise and evolutionary strategies individually.

通过将参数噪声与传统深度强化学习方法相结合，可以在高维离散行动环境和连续控制任务中比传统深度强化学习方法和进化策略更有效地学习，并且在离散和连续领域中参数噪声会比动作空间噪声更优秀。