In an ever-evolving world, the dynamic nature of knowledge presents challenges for language models that are trained on static data, leading to outdated encoded information. However, real-world scenarios require models not only to acquire new knowledge but also to overwrite outdated information into updated ones. To address this under-explored issue, we introduce the temporally evolving question answering benchmark, EvolvingQA - a novel benchmark designed for training and evaluating LMs on an evolving Wikipedia database, where the construction of our benchmark is automated with our pipeline using large language models. Our benchmark incorporates question-answering as a downstream task to emulate real-world applications. Through EvolvingQA, we uncover that existing continual learning baselines have difficulty in updating and forgetting outdated knowledge. Our findings suggest that the models fail to learn updated knowledge due to the small weight gradient. Furthermore, we elucidate that the models struggle mostly on providing numerical or temporal answers to questions asking for updated knowledge. Our work aims to model the dynamic nature of real-world information, offering a robust measure for the evolution-adaptability of language models.

为了解决语言模型在知识不断演进的情况下需要获取新知识并更新旧知识的问题，我们引入了一个新颖的基准测试，EvolvingQA，它用于训练和评估语言模型在一个不断演进的维基百科数据库上的能力，通过引入问题回答作为下游任务模拟了真实世界应用。通过研究发现，现有的持续学习基准在更新和遗忘过时知识方面存在困难，主要是由于小的权重梯度导致模型无法学习到更新的知识。此外，我们发现模型在提供数值或时间答案以及问及更新知识的问题上遇到了较大困难。我们的工作旨在对真实世界信息的动态性进行建模，并为语言模型的演进适应能力提供了一个强有力的度量。

抓住时机：关于终身语言模型中世界知识评估的研究