A recent focus of large language model (LLM) development, as exemplified by
generative search engines, is to incorporate external references to generate
and support their claims. However, evaluating the attribution, i.e., verifying
whether the generated statement is indeed fully supported by the cited
reference, remains an open problem. Although human evaluation is common
practice, it is costly and time-consuming. In this paper, we investigate the
automatic evaluation of attribution by LLMs. We begin by providing a definition
of attribution and then explore two approaches for automatic evaluation:
prompting LLMs and fine-tuning smaller LMs. The fine-tuning data is repurposed
from related tasks, such as question answering, fact-checking, natural language
inference, and summarization. To facilitate the evaluation, we manually curate
a set of test examples covering 12 domains from a generative search engine, New
Bing. Our results on the curated test set and simulated test examples from
existing benchmark questions highlight both promising signals as well as
remaining challenges for the automatic evaluation of attribution. We hope our
testbed, modeling methodology, and insights will help lay the foundation for
future studies on this important problem.

本文探讨了大型语言模型在自动评估引用时的两种方法：引导 LLM 和微调更小的 LM。我们手动策划了一组测试样例以涵盖 12 个领域并评估了其自动评估的结果，旨在为这一重要问题的未来研究打下基础。

大型语言模型自动评估归因

Automatic Evaluation of Attribution by Large Language Models

Auto-evaluation aims to automatically evaluate a trained model on any test
dataset without human annotations. Most existing methods utilize global
statistics of features extracted by the model as the representation of a
dataset. This ignores the influence of the classification head and loses
category-wise confusion information of the model. However, ratios of instances
assigned to different categories together with their confidence scores reflect
how many instances in which categories are difficult for the model to classify,
which contain significant indicators for both overall and category-wise
performances. In this paper, we propose a Confidence-based Category
Relation-aware Regression ($C^2R^2$) method. $C^2R^2$ divides all instances in
a meta-set into different categories according to their confidence scores and
extracts the global representation from them. For each category, $C^2R^2$
encodes its local confusion relations to other categories into a local
representation. The overall and category-wise performances are regressed from
global and local representations, respectively. Extensive experiments show the
effectiveness of our method.

本文提出了一种基于置信度和类别关系感知的回归方法，称为 $C^2R^2$，通过利用局部和全局表现建立分类模型和测试数据之间的关联，以实现对已训练模型进行自动评估。

基于置信度类别关系感知回归的自动评估

Toward Auto-evaluation with Confidence-based Category Relation-aware  Regression

The difficulty of textual style transfer lies in the lack of parallel
corpora. Numerous advances have been proposed for the unsupervised generation.
However, significant problems remain with the auto-evaluation of style transfer
tasks. Based on the summary of Pang and Gimpel (2018) and Mir et al. (2019),
style transfer evaluations rely on three criteria: style accuracy of
transferred sentences, content similarity between original and transferred
sentences, and fluency of transferred sentences. We elucidate the problematic
current state of style transfer research. Given that current tasks do not
represent real use cases of style transfer, current auto-evaluation approach is
flawed. This discussion aims to bring researchers to think about the future of
style transfer and style transfer evaluation research.

本文讨论了文本风格迁移技术中的关键问题，即使用无监督生成方法，自动评估风格迁移任务的难点。通过对类似文献的总结，我们阐述了当前风格迁移研究的问题，并指出现有的自动评估方法存在缺陷，无法准确评估迁移后句子的风格准确性、内容相似度和流畅性。本文旨在引导研究者思考风格迁移和评估研究的未来趋势。