Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models. However, RLHF is know to exploit biases in human preferences, such as verbosity. A well-formatted and eloquent answer is often more highly rated by users, even when it is less helpful and objective. A number of approaches have been developed to control those biases in the classical RLHF literature, but the problem remains relatively under-explored for Direct Alignment Algorithms such as Direct Preference Optimization (DPO). Unlike classical RLHF, DPO does not train a separate reward model or use reinforcement learning directly, so previous approaches developed to control verbosity cannot be directly applied to this setting. Our work makes several contributions. For the first time, we study the length problem in the DPO setting, showing significant exploitation in DPO and linking it to out-of-distribution bootstrapping. We then develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality. We demonstrate these effects across datasets on summarization and dialogue, where we achieve up to 20\% improvement in win rates when controlling for length, despite the GPT4 judge's well-known verbosity bias.

人类反馈强化学习对大型语言模型的成功起到至关重要的作用，然而，它存在一些问题，如偏好中的冗长性。本研究通过研究Direct Preference Optimization（DPO）中的长度问题，提出了一种以简单而原则性的正则化策略控制冗长性的方法。在摘要和对话的数据集上，尽管GPT4评判者存在冗长偏见，但我们在控制长度的情况下获得了高达20%的胜率提升。

直接偏好优化中的长度与质量解耦