Direct Preference Optimization (DPO) has recently emerged as a popular
approach to improve reinforcement learning with human feedback (RLHF), leading
to better techniques to fine-tune large language models (LLM). A weakness of
DPO, however, lies in its lack of capability to characterize the diversity of
human preferences. Inspired by Mallows' theory of preference ranking, we
develop in this paper a new approach, the Mallows-DPO. A distinct feature of
this approach is a dispersion index, which reflects the dispersion of human
preference to prompts. We show that existing DPO models can be reduced to
special cases of this dispersion index, thus unified with Mallows-DPO. More
importantly, we demonstrate (empirically) how to use this dispersion index to
enhance the performance of DPO in a broad array of benchmark tasks, from
synthetic bandit selection to controllable generations and dialogues, while
maintaining great generalization capabilities.

Mallows-DPO 是一种新方法，利用人类偏好的分散度指数来改进直接偏好优化方法 (DPO)，从而提高强化学习与人类反馈的性能，适用于各类基准任务，如合成赌徒选择、可控生成和对话，同时保持良好的泛化能力。

Mallows-DPO: 用偏好离散来优化您的 LLM

Mallows-DPO: Fine-Tune Your LLM with Preference Dispersions

Developing an educational test can be expensive and time-consuming, as each
item must be written by experts and then evaluated by collecting hundreds of
student responses. Moreover, many tests require multiple distinct sets of
questions administered throughout the school year to closely monitor students'
progress, known as parallel tests. In this study, we focus on tests of silent
sentence reading efficiency, used to assess students' reading ability over
time. To generate high-quality parallel tests, we propose to fine-tune large
language models (LLMs) to simulate how previous students would have responded
to unseen items. With these simulated responses, we can estimate each item's
difficulty and ambiguity. We first use GPT-4 to generate new test items
following a list of expert-developed rules and then apply a fine-tuned LLM to
filter the items based on criteria from psychological measurements. We also
propose an optimal-transport-inspired technique for generating parallel tests
and show the generated tests closely correspond to the original test's
difficulty and reliability based on crowdworker responses. Our evaluation of a
generated test with 234 students from grades 2 to 8 produces test scores highly
correlated (r=0.93) to those of a standard test form written by human experts
and evaluated across thousands of K-12 students.

通过对大规模语言模型进行微调，以模拟先前学生对未见过的测试项目的响应，生成具有高质量的平行测试，并且通过对成千上万名 K-12 学生进行评估，证明生成的测试与人工专家编写的标准测试的难度和可靠性高度相关。