Previous studies have relied on existing question-answering benchmarks to evaluate the knowledge stored in large language models (LLMs). However, this approach has limitations regarding factual knowledge coverage, as it mostly focuses on generic domains which may overlap with the pretraining data. This paper proposes a framework to systematically assess the factual knowledge of LLMs by leveraging knowledge graphs (KGs). Our framework automatically generates a set of questions and expected answers from the facts stored in a given KG, and then evaluates the accuracy of LLMs in answering these questions. We systematically evaluate the state-of-the-art LLMs with KGs in generic and specific domains. The experiment shows that ChatGPT is consistently the top performer across all domains. We also find that LLMs performance depends on the instruction finetuning, domain and question complexity and is prone to adversarial context.

通过利用知识图谱 (KGs) 来系统评估大型语言模型 (LLMs) 的事实知识，本文提出了一个框架。我们的框架通过从给定 KG 中存储的事实自动生成一组问题和预期答案，然后评估 LLMs 回答这些问题的准确性。我们在通用和特定领域系统评估了最先进的 LLMs，实验证明 ChatGPT 在所有领域中始终是最佳表现者。我们还发现 LLMs 的表现取决于指导微调、领域和问题的复杂性，并且容易受到对抗性环境的影响。

大型语言模型中的事实知识系统评估