In the age of artificial intelligence, the role of large language models (LLMs) is becoming increasingly central. Despite their growing prevalence, their capacity to consolidate knowledge from different training documents - a crucial ability in numerous applications - remains unexplored. This paper presents the first study examining the capability of LLMs to effectively combine such information within their parameter space. We introduce EpiK-Eval, a novel question-answering benchmark tailored to evaluate LLMs' proficiency in formulating a coherent and consistent knowledge representation from segmented narratives. Evaluations across various LLMs reveal significant weaknesses in this domain. We contend that these shortcomings stem from the intrinsic nature of prevailing training objectives. Consequently, we advocate for refining the approach towards knowledge consolidation, as it harbors the potential to dramatically improve their overall effectiveness and performance. The findings from this study offer insights for developing more robust and reliable LLMs. Our code and benchmark are available at https://github.com/chandar-lab/EpiK-Eval

通识大语言模型（LLMs）在人工智能时代的作用越来越核心，本文探讨了LLMs的能力，将不同的训练文档中的知识进行整合，以提高它们的整体有效性和性能。通过引入一个问题回答基准测试，作者发现现有的LLMs在这方面存在显著弱点，呼吁改进知识整合方法以开发更强大可靠的LLMs。

EpiK-Eval: 评估作为知识模型的语言模型