High-quality preference datasets are essential for training reward models
that can effectively guide large language models (LLMs) in generating
high-quality responses aligned with human preferences. As LLMs become stronger
and better aligned, permissively licensed preference datasets, such as Open
Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for
reward modeling. Methods that distil preference data from proprietary LLMs such
as GPT-4 have restrictions on commercial usage imposed by model providers. To
improve upon both generated responses and attribute labeling quality, we
release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0).
Using a powerful internal base model trained on HelpSteer2, we are able to
achieve the SOTA score (92.0%) on Reward-Bench's primary dataset, outperforming
currently listed open and proprietary models, as of June 12th, 2024. Notably,
HelpSteer2 consists of only ten thousand response pairs, an order of magnitude
fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly
efficient for training reward models. Our extensive experiments demonstrate
that reward models trained with HelpSteer2 are effective in aligning LLMs. In
particular, we propose SteerLM 2.0, a model alignment approach that can
effectively make use of the rich multi-attribute score predicted by our reward
models. HelpSteer2 is available at
this https URL and code is available at
this https URL

通过使用 HelpSteer2 进行训练，我们提出了 SteerLM 2.0 模型对齐方法，能够有效利用我们的奖励模型预测的多属性分数，从而在对齐大型语言模型方面取得了 92.0% 的最新成果。

HelpSteer2: 用于训练最佳奖励模型的开源数据集

HelpSteer2: Open-source dataset for training top-performing reward  models

Preference datasets are essential for incorporating human preferences into
pre-trained language models, playing a key role in the success of Reinforcement
Learning from Human Feedback. However, these datasets often demonstrate
conflicting alignment objectives, leading to increased vulnerability to
jailbreak attacks and challenges in adapting downstream tasks to prioritize
specific alignment objectives without negatively impacting others. In this
work, we introduce a novel statistical metric, Alignment Dimension Conflict, to
quantify the degree of conflict within preference datasets. We then present
\texttt{Hummer} and its fine-grained variant, \texttt{Hummer-F}, as innovative
pairwise preference datasets with reduced-conflict alignment objectives.
\texttt{Hummer} is built based on UltraFeedback and is enhanced by AI feedback
from GPT-4, marking as the first preference dataset aimed at reducing the
competition between alignment objectives. Furthermore, we develop reward
models, HummerRM and HummerRM-F, which employ a hybrid sampling approach to
balance diverse alignment objectives effectively. This sampling method
positions HummerRM as an ideal model for domain-specific further fine-tuning
and reducing vulnerabilities to attacks.

引入了一种新的统计度量指标，即 Alignment Dimension Conflict，用于量化偏好数据集内部的冲突程度。提出了 Hummer 和 Hummer-F 这两个创新的成对偏好数据集，并开发了 HummerRM 和 HummerRM-F 这两个奖励模型，有效平衡多样的对齐目标，适用于领域特定的进一步微调和减少攻击的弱点。

Hummer: 朝着有限竞争偏好数据集的方向

Hummer: Towards Limited Competitive Preference Dataset

Aligning large language models (LLMs) with human intentions has become a
critical task for safely deploying models in real-world systems. While existing
alignment approaches have seen empirical success, theoretically understanding
how these methods affect model behavior remains an open question. Our work
provides an initial attempt to theoretically analyze the learning dynamics of
human preference alignment. We formally show how the distribution of preference
datasets influences the rate of model updates and provide rigorous guarantees
on the training accuracy. Our theory also reveals an intricate phenomenon where
the optimization is prone to prioritizing certain behaviors with higher
preference distinguishability. We empirically validate our findings on
contemporary LLMs and alignment tasks, reinforcing our theoretical insights and
shedding light on considerations for future alignment approaches. Disclaimer:
This paper contains potentially offensive text; reader discretion is advised.

通过理论分析学习动态，我们提供了对人类偏好对齐的理论观察，揭示了优化算法可能优先考虑具有更高偏好区分度的行为，并通过实证验证对现代语言模型和对齐任务加深了对未来方法的认识。