Recent large language models often answer factual questions correctly. But users can't trust any given claim a model makes without fact-checking, because language models can hallucinate convincing nonsense. In this work we use reinforcement learning from human preferences (RLHP) to train "open-book" QA models that generate answers whilst also citing specific evidence for their claims, which aids in the appraisal of correctness. Supporting evidence is drawn from multiple documents found via a search engine, or from a single user-provided document. Our 280 billion parameter model, GopherCite, is able to produce answers with high quality supporting evidence and abstain from answering when unsure. We measure the performance of GopherCite by conducting human evaluation of answers to questions in a subset of the NaturalQuestions and ELI5 datasets. The model's response is found to be high-quality 80\% of the time on this Natural Questions subset, and 67\% of the time on the ELI5 subset. Abstaining from the third of questions for which it is most unsure improves performance to 90\% and 80\% respectively, approaching human baselines. However, analysis on the adversarial TruthfulQA dataset shows why citation is only one part of an overall strategy for safety and trustworthiness: not all claims supported by evidence are true.

通过强化学习从人类偏好中得出支持性证据，训练生成回答并支持其声称的“开放式书目”QA模型。该模型能够从搜索引擎中找到的多个文档或单个用户提供的文档中提取支持证据。通过在NaturalQuestions和ELI5数据集的子集中进行的人类评估，该模型的响应在这两个子集中80％和67％的时间内达到高质量水平，但并非所有声称都被支持的证据是正确的。

教授语言模型通过验证的引文支持答案