The interactive nature of Large Language Models (LLMs) theoretically allows
models to refine and improve their answers, yet systematic analysis of the
multi-turn behavior of LLMs remains limited. In this paper, we propose the
FlipFlop experiment: in the first round of the conversation, an LLM responds to
a prompt containing a classification task. In a second round, the LLM is
challenged with a follow-up phrase like "Are you sure?", offering an
opportunity for the model to reflect on its initial answer, and decide whether
to confirm or flip its answer. A systematic study of nine LLMs on seven
classification tasks reveals that models flip their answers on average 46% of
the time and that all models see a deterioration of accuracy between their
first and final prediction, with an average drop of 17%. The FlipFlop
experiment illustrates the universality of sycophantic behavior in LLMs and
provides a robust framework to analyze model behavior and evaluate potential
solutions.

通过 FlipFlop 实验，该研究探讨了大型语言模型的多轮互动行为，发现模型在回答问题时会反思并改进答案，提供了分析模型行为和评估潜在解决方案的可靠框架。

您确定吗？在 FlipFlop 实验中挑战 LLMs 导致性能下降

Are You Sure? Challenging LLMs Leads to Performance Drops in The  FlipFlop Experiment

Sycophancy is an undesirable behavior where models tailor their responses to
follow a human user's view even when that view is not objectively correct
(e.g., adapting liberal views once a user reveals that they are liberal). In
this paper, we study the prevalence of sycophancy in language models and
propose a simple synthetic-data intervention to reduce this behavior.
First, on a set of three sycophancy tasks (Perez et al., 2022) where models
are asked for an opinion on statements with no correct answers (e.g.,
politics), we observe that both model scaling and instruction tuning
significantly increase sycophancy for PaLM models up to 540B parameters.
Second, we extend sycophancy evaluations to simple addition statements that are
objectively incorrect, finding that despite knowing that these statements are
wrong, language models will still agree with them if the user does as well.
To reduce sycophancy, we present a straightforward synthetic-data
intervention that takes public NLP tasks and encourages models to be robust to
user opinions on these tasks. Adding these data in a lightweight finetuning
step can significantly reduce sycophantic behavior on held-out prompts. Code
for generating synthetic data for intervention can be found at
this https URL

本研究探讨了语言模型中阿谀奉承行为的普遍性，并提出了一种简单的合成数据干预来降低这种行为，通过在轻量级微调步骤中添加公共自然语言处理任务的合成数据，可以显著减少对用户意见的阿谀奉承行为。