In this paper, we introduce \emph{refined Direct Preference Optimization}
(rDPO), a method for improving the behavioral alignment of Large Language
Models (LLMs) without the need for human-annotated data. The method involves
creating synthetic data using self-critique prompting by a teacher LLM and then
utilising a generalized DPO loss function to distil to a student LLM. The loss
function incorporates an additional external reward model to improve the
quality of synthetic data, making rDPO robust to potential noise in the
synthetic dataset. rDPO is shown to be effective in a diverse set of
behavioural alignment tasks, such as improved safety, robustness against
role-playing, and reduced sycophancy. Code to be released at
this https URL

提出一种称为 “rDPO” 的方法，通过自我批评引导创建合成数据，并利用广义的 DPO 损失函数蒸馏为学生 LLM，其中使用额外的外部奖励模型提高合成数据质量，从而改善大型语言模型的行为对齐。

通过合成数据对 LMLs 进行行为对齐的优化提炼直接偏好优化

Refined Direct Preference Optimization with Synthetic Data for  Behavioral Alignment of LLMs

Numerous works are proposed to improve or evaluate the capabilities of Large
language models (LLMs) to fulfill user instructions. However, they neglect the
possibility that user inputs may inherently contain incorrect information due
to users' false beliefs or malicious intents. In this way, blindly adhering to
users' false content will cause deception and harm. To address this problem, we
propose a challenging benchmark consisting of Inductive Instructions (INDust)
to evaluate whether LLMs could resist these instructions. The INDust includes
15K instructions across three categories: Fact-Checking Instructions, Questions
based on False Premises, and Creative Instructions based on False Premises. Our
experiments on several strong LLMs reveal that current LLMs can be easily
deceived by INDust into generating misleading and malicious statements. Hence
we employ Self-Critique prompting to encourage LLMs to not only critique
themselves like in previous works but also the users, which show remarkable
improvement in handling inductive instructions under both zero-shot and
few-shot settings.

本文提出了一种名为 INDust（Inductive Instructions）的挑战基准来评估大型语言模型（LLMs）是否能够抵抗用户提供的带误导性的指令，并提出了一种名为 Self-Critique prompting 的方法来防范 LLMs 误导用户。实验证明该方法在零样本和小样本环境下都能有效提升 LLMs 的对归纳指令的处理能力。