In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy's performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution's sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.

本研究解决了在大型语言模型后训练过程中，使用人类反馈的强化学习方法中直接偏好优化（DPO）面临的最低KL散度未能有效捕捉参考分布模式的问题。我们提出了一种简单的修改H-DPO，使得可控的熵有助于增强分布的尖锐性，从而更有效地支持模式寻求拟合。实验表明，H-DPO在各种任务中均优于DPO，展示出在数学任务中获得的优越结果，表明其在大型语言模型训练中的实际意义和应用潜力。

可控熵直接偏好优化