Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. In the past it uses reinforcement learning from human feedback (RLHF) algorithm to optimize the LLM policy to align with these preferences and not to draft too far from the original model. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. Using preference pairs of chosen and reject data, DPO models the relative log probability as implicit reward function and optimize LLM policy using a simple binary cross entropy objective directly. DPO is quite straight forward and easy to be understood. It perform efficiently and well in most cases. In this article, we analyze the working mechanism of $\beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification. With these insights, we propose MinorDPO, which is better aligned to the original RL algorithm, and increase the stability of preference optimization process.

本研究解决了现有直接偏好优化(DPO)方法在训练大型语言模型时对人类偏好的对齐问题。通过对DPO中$\beta$机制的分析和改进，提出了MinorDPO方法，使其在偏好优化过程中更稳定，并与原始强化学习算法更好地对齐。该方法的显著发现是可以提高训练的鲁棒性，从而增强模型性能。

减少DPO拒绝惩罚以增加训练的鲁棒性