Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks. Yet, this flexibility brings new challenges, as it introduces new degrees of freedom in formulating the task inputs and instructions and in evaluating model performance. To facilitate the exploration of creative NLG tasks, we propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement. We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric and has not yet been tackled within the LLM paradigm. Our results highlight the importance of systematically investigating both task instruction and input configuration when prompting LLMs, and reveal non-trivial relationships between different evaluation metrics used for citation text generation. Additional human generation and human evaluation experiments provide new qualitative insights into the task to guide future research in citation text generation. We make our code and data publicly available.

大型语言模型（LLMs）在定义和执行复杂的创造性自然语言生成（NLG）任务方面带来了前所未有的灵活性。然而，这种灵活性也带来了新的挑战，因为它在制定任务输入和指令以及评估模型性能方面引入了新的自由度。为了促进创造性NLG任务的探索，我们提出了一个由系统输入操作、参考数据和输出度量组成的三组件研究框架。我们使用该框架来研究引文文本生成——这是一个在学术界的NLP任务中广受欢迎的任务，对任务定义和评估指标缺乏共识，并且尚未在LLM范式中解决。我们的结果凸显了在提示LLMs时系统地调查任务指令和输入配置的重要性，并揭示了用于引文文本生成的不同评估指标之间的非平凡关系。额外的人工生成和人工评估实验为指导未来的引文文本生成研究提供了新的定性见解。我们公开提供我们的代码和数据。

基于LLMs的系统任务探索：引文文本生成研究