While large language models (LLMs) already achieve strong performance on
standard generic summarization benchmarks, their performance on more complex
summarization task settings is less studied. Therefore, we benchmark LLMs on
instruction controllable text summarization, where the model input consists of
both a source article and a natural language requirement for the desired
summary characteristics. To this end, we curate an evaluation-only dataset for
this task setting and conduct human evaluation on 5 LLM-based summarization
systems. We then benchmark LLM-based automatic evaluation for this task with 4
different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods
in total. Our study reveals that instruction controllable text summarization
remains a challenging task for LLMs, since (1) all LLMs evaluated still make
factual and other types of errors in their summaries; (2) all LLM-based
evaluation methods cannot achieve a strong alignment with human annotators when
judging the quality of candidate summaries; (3) different LLMs show large
performance gaps in summary generation and evaluation. We make our collected
benchmark, InstruSum, publicly available to facilitate future research in this
direction.

语言模型在标准的概括基准测试中已经取得了强大的性能，但在更复杂的概括任务设置上的表现却鲜少被研究。本研究基于指令可控的文本概括对语言模型进行评估，并使用多种评估协议和多个语言模型进行了自动评估。研究结果表明，指令可控的文本概括对于语言模型仍然是一个具有挑战性的任务，存在各种错误和性能差异。我们公开提供了我们的评估基准 IntruSum，以促进未来的相关研究。