Long-form question answering (LFQA) enables answering a wide range of
questions, but its flexibility poses enormous challenges for evaluation. We
perform the first targeted study of the evaluation of long-form answers,
covering both human and automatic evaluation practices. We hire domain experts
in seven areas to provide preference judgments over pairs of answers, along
with free-form justifications for their choices. We present a careful analysis
of experts' evaluation, which focuses on new aspects such as the
comprehensiveness of the answer. Next, we examine automatic text generation
metrics, finding that no existing metrics are predictive of human preference
judgments. However, some metrics correlate with fine-grained aspects of answers
(e.g., coherence). We encourage future work to move away from a single "overall
score" of the answer and adopt a multi-faceted evaluation, targeting aspects
such as factuality and completeness. We publicly release all of our annotations
and code to spur future work into LFQA evaluation.

对长篇答案进行有针对性的评估研究，强调评估多维度因素，发现自动文本生成的评价指标不能预测人类喜好，建议未来的评估中，应该注重准确性、完整性和客观性等多个方面。