Reinforcement Learning from Human Feedback (RLHF) involves training policy
models (PMs) and reward models (RMs) to align language models with human
preferences. Instead of focusing solely on PMs and RMs independently, we
propose to examine their interactions during fine-tuning, introducing the
concept of seamlessness. Our study starts with observing the saturation
phenomenon, where continual improvements in RM and PM do not translate into
RLHF progress. Our analysis shows that RMs fail to assign proper scores to PM
responses, resulting in a 35% mismatch rate with human preferences,
highlighting a significant discrepancy between PM and RM. To measure
seamlessness between PM and RM without human effort, we propose an automatic
metric, SEAM. SEAM quantifies the discrepancies between PM and RM judgments
induced by data samples. We validate the effectiveness of SEAM in data
selection and model augmentation. Our experiments demonstrate that (1) using
SEAM-filtered data for RL training improves RLHF performance by 4.5%, and (2)
SEAM-guided model augmentation results in a 4% performance improvement over
standard augmentation methods.

借助强化学习从人类反馈中进行训练，通过训练策略模型和奖励模型来使语言模型与人类偏好相一致；我们提出了研究对策略模型和奖励模型之间的交互作用进行微调的无缝度概念，探索了其对性能的影响，并引入了自动度量标准 SEAM 来度量两者之间的无缝度。实验证明，利用 SEAM 进行数据选择和模型增强可以显著提高强化学习从人类反馈中的性能。

奖励和策略模型在强化学习中的无缝衔接探讨

It Takes Two: On the Seamlessness between Reward and Policy Model in  RLHF

Recent advancements in large language models (LLMs) aim to tackle
heterogeneous human expectations and values via multi-objective preference
alignment. However, existing methods are parameter-adherent to the policy
model, leading to two key limitations: (1) the high-cost repetition of their
alignment algorithms for each new target model; (2) they cannot expand to
unseen objectives due to their static alignment objectives. In this work, we
propose Meta-Objective Aligner (MetaAligner), a model that performs conditional
weak-to-strong correction for weak responses to approach strong responses.
MetaAligner is the first policy-agnostic and generalizable method for
multi-objective preference alignment, which enables plug-and-play alignment by
decoupling parameter updates from the policy models and facilitates zero-shot
preference alignment for unseen objectives via in-context learning.
Experimental results show that MetaAligner achieves significant and balanced
improvements in multi-objective alignments on 11 policy models with up to 63x
more parameters, and outperforms previous alignment methods with down to 22.27x
less computational resources. The model also accurately aligns with unseen
objectives, marking the first step towards generalizable multi-objective
preference alignment.

大型语言模型最新的研究进展着重于通过多目标偏好对齐来解决异质人类期望和价值的问题。然而，现有方法对策略模型具有参数黏着性，导致两个主要限制：（1）对每个新的目标模型，其对齐算法的高成本重复使用；（2）它们不能扩展到未知的目标，因为其静态对齐目标。在这项工作中，我们提出了元目标对齐器（MetaAligner），这是一种执行从弱响应到强响应的有条件强化修正的模型。MetaAligner 是第一个对策略不依赖和通用化的多目标偏好对齐方法，通过将参数更新与政策模型分离，实现了即插即用的对齐，并通过上下文学习实现了对未知目标的零样本偏好对齐。实验结果表明，MetaAligner 在 11 个策略模型上实现了显著且平衡的多目标对齐改进，其中有多达 63 倍的参数，并且比以往的对齐方法需要少达 22.27 倍的计算资源。该模型还准确地与未知目标对齐，标志着通用多目标偏好对齐的第一步。

MetaAligner: 通用多目标语言模型对齐的条件弱到强校正

MetaAligner: Conditional Weak-to-Strong Correction for Generalizable  Multi-Objective Alignment of Language Models

Multi-agent robotic systems are increasingly operating in real-world
environments in close proximity to humans, yet are largely controlled by policy
models with inscrutable deep neural network representations. We introduce a
method for incorporating interpretable concepts from a domain expert into
models trained through multi-agent reinforcement learning, by requiring the
model to first predict such concepts then utilize them for decision making.
This allows an expert to both reason about the resulting concept policy models
in terms of these high-level concepts at run-time, as well as intervene and
correct mispredictions to improve performance. We show that this yields
improved interpretability and training stability, with benefits to policy
performance and sample efficiency in a simulated and real-world
cooperative-competitive multi-agent game.

本文介绍了一种将领域专家的可解释概念纳入到多智能体强化学习模型中的方法，以提高模型的解释性和稳定性，提高性能和样本效率。