The open-ended nature of language generation makes the evaluation of
autoregressive large language models (LLMs) challenging. One common evaluation
approach uses multiple-choice questions (MCQ) to limit the response space. The
model is then evaluated by ranking the candidate answers by the log probability
of the first token prediction. However, first-tokens may not consistently
reflect the final response output, due to model's diverse response styles such
as starting with "Sure" or refusing to answer. Consequently, MCQ evaluation is
not indicative of model behaviour when interacting with users. But by how much?
We evaluate how aligned first-token evaluation is with the text output along
several dimensions, namely final option choice, refusal rate, choice
distribution and robustness under prompt perturbation. Our results show that
the two approaches are severely misaligned on all dimensions, reaching mismatch
rates over 60%. Models heavily fine-tuned on conversational or safety data are
especially impacted. Crucially, models remain misaligned even when we
increasingly constrain prompts, i.e., force them to start with an option letter
or example template. Our findings i) underscore the importance of inspecting
the text output, too and ii) caution against relying solely on first-token
evaluation.

对于自动生成语言模型 (LLMs)，评估其面临挑战的一个常见方法是使用多项选择题 (MCQ) 来限制回应的范围，通过排名候选答案首个 token 预测的对数概率来评估模型。然而，由于模型存在多样的回应方式，例如以 “当然” 开始或拒绝回答，首个 token 可能不一致地反映最后的回应输出。因此，MCQ 评估对于模型与用户交互时的行为并不具有指示作用。我们评估了首个 token 评估与文本输出在最终选项选择、拒绝率、选择分布和对提示扰动的鲁棒性等多个维度之间的一致性程度。结果显示，两种方法在所有维度上严重不一致，不一致率超过 60%。在对话或安全数据上进行大规模微调的模型尤其受到影响。关键是，即使我们越来越限制提示的方式，例如强制以选项字母或示例模板开始，模型仍然不一致。我们的发现强调了检查文本输出的重要性，并警告不能仅仅依赖于首个 token 的评估。

我的答案是 C”：指令调整的语言模型中的首词概率与文本答案不匹配

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in  Instruction-Tuned Language Models

Advancements in large language models (LLMs) have demonstrated remarkable
capabilities across a diverse range of applications. These models excel in
generating text completions that are contextually coherent and cover an
extensive array of subjects. However, the vast datasets required for their
training make aligning response styles during the pretraining and instruction
tuning phases challenging. Consequently, an additional alignment phase is
typically employed, wherein the model is further trained with human preference
data to better align its outputs with human expectations. While this process
doesn't introduce new capabilities per se, it does accentuate generation styles
innate to the model. This paper explores the utilization of counterfactual
prompting within the framework of Direct Preference Optimization (DPO) to align
the model's style without relying on human intervention. We demonstrate that
this method effectively instils desirable behaviour, mitigates undesirable
ones, and encourages the model to disregard inappropriate instructions. Our
findings suggest that counterfactual prompting with DPO presents a low-resource
way to fine-tune LLMs to meet the demands for responsible and ethically aligned
AI systems.

探究利用反事实提示以及直接偏好优化框架来对齐模型风格的方法，该方法有效地注入了良好的行为并减轻了不理想的情况，鼓励模型忽略不合适的指令，从而以低成本的方式使大型语言模型满足对负责任和道德对齐的人工智能系统的需求。

使用反事实数据处理器调整大型语言模型

Aligning Large Language Models with Counterfactual DPO

Considerable progress has been made towards conversational models that
generate coherent and fluent responses by training large language models on
large dialogue datasets. These models have little or no control of the
generated responses and miss two important features: continuous dialogue skills
integration and seamlessly leveraging diverse knowledge sources. In this paper,
we propose the Adapter-Bot, a dialogue model that uses a fixed backbone
conversational model such as DialGPT (Zhang et al., 2019) and triggers
on-demand dialogue skills (e.g., emphatic response, weather information, movie
recommendation) via different adapters (Houlsby et al., 2019). Each adapter can
be trained independently, thus allowing a continual integration of skills
without retraining the entire model. Depending on the skills, the model is able
to process multiple knowledge types, such as text, tables, and graphs, in a
seamless manner. The dialogue skills can be triggered automatically via a
dialogue manager, or manually, thus allowing high-level control of the
generated responses. At the current stage, we have implemented 12 response
styles (e.g., positive, negative etc.), 8 goal-oriented skills (e.g. weather
information, movie recommendation, etc.), and personalized and emphatic
responses. We evaluate our model using automatic evaluation by comparing it
with existing state-of-the-art conversational models, and we have released an
interactive system at adapter.bot.ust.hk.

本研究提出了 Adapter-Bot，一种对话模型，使用不同的适配器触发按需的对话技能，并实现不间断集成和无缝利用多种知识源，通过与现有最先进的对话模型进行比较的自动评估来评估我们的模型。