Existing methods for controlling language models, such as RLHF and
Constitutional AI, involve determining which LLM behaviors are desirable and
training them into a language model. However, in many cases, it is desirable
for LLMs to be controllable \textit{at inference time}, so that they can be
used in multiple contexts with diverse needs. We illustrate this with the
\textbf{Pink Elephant Problem}: instructing an LLM to avoid discussing a
certain entity (a ``Pink Elephant''), and instead discuss a preferred entity
(``Grey Elephant''). We apply a novel simplification of Constitutional AI,
\textbf{Direct Principle Feedback}, which skips the ranking of responses and
uses DPO directly on critiques and revisions. Our results show that after DPF
fine-tuning on our synthetic Pink Elephants dataset, our 13B fine-tuned LLaMA 2
model significantly outperforms Llama-2-13B-Chat and a prompted baseline, and
performs as well as GPT-4 in on our curated test set assessing the Pink
Elephant Problem.

通过对现有语言模型的控制方法如 RLHF 和宪法 AI 的研究，我们发现在许多情况下，希望在推理时对语言模型进行控制，以便在不同背景下满足多样化的需求。我们通过一个 “粉象问题” 示例说明了这一点，即指导语言模型避免讨论某一特定实体（“粉象”），而是讨论一个首选实体（“灰象”）。我们应用一种新颖的宪法 AI 简化方法，即 “直接原则反馈”，跳过对回应的排序，并直接使用 DPO 在评论和修改上。我们的研究结果表明，在我们的合成粉象数据集上进行 DPF 微调后，我们的 13B 微调 LLaMA 2 模型在性能上显着优于 Llama-2-13B-Chat 和基准测试，并且在我们对粉象问题进行评估的策划测试集中表现与 GPT-4 相当。