A common technique for aligning large language models (LLMs) relies on
acquiring human preferences by comparing multiple generations conditioned on a
fixed context. This only leverages the pairwise comparisons when the
generations are placed in an identical context. However, such conditional
rankings often fail to capture the complex and multidimensional aspects of
human preferences. In this work, we revisit the traditional paradigm of
preference acquisition and propose a new axis that is based on eliciting
preferences jointly over the instruction-response pairs. While prior preference
optimizations are designed for conditional ranking protocols (e.g., DPO), our
proposed preference acquisition protocol introduces DOVE, a new preference
optimization objective that upweights the joint probability of the chosen
instruction-response pair over the rejected instruction-response pair.
Interestingly, we find that the LLM trained with joint instruction-response
preference data using DOVE outperforms the LLM trained with DPO by 5.2% and
3.3% win-rate for the summarization and open-ended dialogue datasets,
respectively. Our findings reveal that joint preferences over instruction and
response pairs can significantly enhance the alignment of LLMs by tapping into
a broader spectrum of human preference elicitation. The data and code is
available at this https URL

通过联合指导 - 回应偏好数据进行大型语言模型训练，使用 DOVE 目标函数优化，可以显著提高 LLM 的对齐效果，并在总结和开放式对话数据集上分别提高 5.2% 和 3.3% 胜率。