Among the remarkable emergent capabilities of large language models (LMs) is
free-text rationalization; beyond a certain scale, large LMs are capable of
generating seemingly useful rationalizations, which in turn, can dramatically
enhance their performances on leaderboards. This phenomenon raises a question:
can machine generated rationales also be useful for humans, especially when lay
humans try to answer questions based on those machine rationales? We observe
that human utility of existing rationales is far from satisfactory, and
expensive to estimate with human studies. Existing metrics like task
performance of the LM generating the rationales, or similarity between
generated and gold rationales are not good indicators of their human utility.
While we observe that certain properties of rationales like conciseness and
novelty are correlated with their human utility, estimating them without human
involvement is challenging. We show that, by estimating a rationale's
helpfulness in answering similar unseen instances, we can measure its human
utility to a better extent. We also translate this finding into an automated
score, GEN-U, that we propose, which can help improve LMs' ability to generate
rationales with better human utility, while maintaining most of its task
performance. Lastly, we release all code and collected data with this project.

大型语言模型可生成可用理性，但其人类实用性不佳，因此我们提出了一个自动化评分系统 GEN-U 来衡量基于无人参与的人类实用性的帮助性，并最大限度地保持任务绩效。

机器理由（未必）对人类有用吗？衡量和提高自由文本理由的人类效用

Are Machine Rationales (Not) Useful to Humans? Measuring and Improving  Human Utility of Free-Text Rationales

When Question-Answering (QA) systems are deployed in the real world, users
query them through a variety of interfaces, such as speaking to voice
assistants, typing questions into a search engine, or even translating
questions to languages supported by the QA system. While there has been
significant community attention devoted to identifying correct answers in
passages assuming a perfectly formed question, we show that components in the
pipeline that precede an answering engine can introduce varied and considerable
sources of error, and performance can degrade substantially based on these
upstream noise sources even for powerful pre-trained QA models. We conclude
that there is substantial room for progress before QA systems can be
effectively deployed, highlight the need for QA evaluation to expand to
consider real-world use, and hope that our findings will spur greater community
interest in the issues that arise when our systems actually need to be of
utility to humans.

本文研究 Question-Answering 系统在实际部署中的问题，发现在回答引擎之前的管道部件可能会引入多样化且可观的错误，而且即使是针对强大的预训练 QA 模型，性能也会因为这些上游噪声源而显著降低。作者认为在 QA 系统能够真正有效部署之前，还有很大的改进空间。因此，他们强调 QA 评估需要扩展到考虑实际使用情况，并希望他们的研究结果能引起更广泛的关注。