TL;DR通过动态更新beta值和优化数据质量,改进了直接偏好优化方法(DPO)在训练大型语言模型(Large Language Models, LLMs)时的性能。
Abstract
direct preference optimization (DPO) has emerged as a compelling approach for training large language models (LLMs) to adhere to human preferences. However, the performance of DPO is sensitive to the fine-tuning