In evaluating the long-context capabilities of large language models (LLMs), benchmarks such as "Needle-in-a-Haystack" (NIAH), Ruler, and Needlebench are commonly used. While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, LongGenBench, which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the LongGenBench, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.

本研究解决了现有基准无法有效评估长篇文本生成质量的问题，提出了LongGenBench基准，专注于测试模型在生成长文时是否能准确包含特定事件。研究发现，尽管模型在传统长上下文基准上表现良好，但在LongGenBench上均未能达到令人满意的效果，尤其是在生成文本长度增加时性能明显下降。

LongGenBench：长上下文大语言模型的长篇生成基准测试