Retriever-augmented instruction-following models are attractive alternatives
to fine-tuned approaches for information-seeking tasks such as question
answering (QA). By simply prepending retrieved documents in its input along
with an instruction, these models can be adapted to various information domains
and tasks without additional fine-tuning. While the model responses tend to be
natural and fluent, the additional verbosity makes traditional QA evaluation
metrics such as exact match (EM) and F1 unreliable for accurately quantifying
model performance.
In this work, we investigate the performance of instruction-following models
across three information-seeking QA tasks. We use both automatic and human
evaluation to evaluate these models along two dimensions: 1) how well they
satisfy the user's information need (correctness), and 2) whether they produce
a response based on the provided knowledge (faithfulness). Guided by human
evaluation and analysis, we highlight the shortcomings of traditional metrics
for both correctness and faithfulness. We then propose simple token-overlap
based and model-based metrics that reflect the true performance of these
models. Our analysis reveals that instruction-following models are competitive,
and sometimes even outperform fine-tuned models for correctness. However, these
models struggle to stick to the provided knowledge and often hallucinate in
their responses. We hope our work encourages a more holistic evaluation of
instruction-following models for QA. Our code and data is available at
this https URL

研究中使用检索辅助的指令跟随模型在信息搜索问答任务中的性能表现，并分析了传统指标的不足之处，提出了反映这些模型真实性能的简单基于词汇重叠和模型的度量标准。研究发现，指令跟随模型在正确性方面具有一定竞争力，甚至有时优于微调模型，但在基于提供的知识的还原度上存在困难，经常出现虚构回答。

评估问题回答的指令遵循模型的准确性和忠实性

Evaluating Correctness and Faithfulness of Instruction-Following Models  for Question Answering

The advent of ChatGPT, a large language model-powered chatbot, has prompted
questions about its potential implications for traditional search engines. In
this study, we investigate the differences in user behavior when employing
search engines and chatbot tools for information-seeking tasks. We carry out a
randomized online experiment, dividing participants into two groups: one using
a ChatGPT-like tool and the other using a Google Search-like tool. Our findings
reveal that the ChatGPT group consistently spends less time on all tasks, with
no significant difference in overall task performance between the groups.
Notably, ChatGPT levels user search performance across different education
levels and excels in answering straightforward questions and providing general
solutions but falls short in fact-checking tasks. Users perceive ChatGPT's
responses as having higher information quality compared to Google Search,
despite displaying a similar level of trust in both tools. Furthermore,
participants using ChatGPT report significantly better user experiences in
terms of usefulness, enjoyment, and satisfaction, while perceived ease of use
remains comparable between the two tools. However, ChatGPT may also lead to
overreliance and generate or replicate misinformation, yielding inconsistent
results. Our study offers valuable insights for search engine management and
highlights opportunities for integrating chatbot technologies into search
engine designs.

本研究探讨了使用搜索引擎和聊天机器人工具进行信息寻求任务时用户行为的差异，并表明 ChatGPT 组在所有任务中花费的时间都较少，用户反馈的用户体验显著更好，然而，ChatGPT 还可能会导致过度依赖并产生或复制错误信息。