In recent years, many NLP studies have focused solely on performance improvement. In this work, we focus on the linguistic and scientific aspects of NLP. We use the task of generating referring expressions in context (REG-in-context) as a case study and start our analysis from GREC, a comprehensive set of shared tasks in English that addressed this topic over a decade ago. We ask what the performance of models would be if we assessed them (1) on more realistic datasets, and (2) using more advanced methods. We test the models using different evaluation metrics and feature selection experiments. We conclude that GREC can no longer be regarded as offering a reliable assessment of models' ability to mimic human reference production, because the results are highly impacted by the choice of corpus and evaluation metrics. Our results also suggest that pre-trained language models are less dependent on the choice of corpus than classic Machine Learning models, and therefore make more robust class predictions.

近年来，许多自然语言处理（NLP）的研究主要关注于性能改进。本文通过在上下文中生成指代表达式（REG-in-context）的任务作为案例研究，聚焦于NLP的语言和科学方面。我们对GREC进行分析，这是一个十多年前在英语中解决这个主题的多样共享任务的综合数据集。我们研究了模型在更现实的数据集上和使用更先进方法评估时的表现。我们通过不同评估指标和特征选择实验来测试这些模型。我们得出结论，GREC不能再被视为可靠评估模型仿真人类参考生成能力的工具，因为结果受到语料库和评估指标选项的极大影响。我们的结果还表明，预训练语言模型对语料库的选择不太依赖，相比传统机器学习模型更能提供更强大的类别预测。

参考生成模型：如何经受时间的考验？