Large language models (LLMs) demonstrate remarkable performance on knowledge-intensive tasks, suggesting that real-world knowledge is encoded in their model parameters. However, besides explorations on a few probing tasks in limited knowledge domains, it is not well understood how to evaluate LLMs' knowledge systematically and how well their knowledge abilities generalize, across a spectrum of knowledge domains and progressively complex task formats. To this end, we propose KGQuiz, a knowledge-intensive benchmark to comprehensively investigate the knowledge generalization abilities of LLMs. KGQuiz is a scalable framework constructed from triplet-based knowledge, which covers three knowledge domains and consists of five tasks with increasing complexity: true-or-false, multiple-choice QA, blank filling, factual editing, and open-ended knowledge generation. To gain a better understanding of LLMs' knowledge abilities and their generalization, we evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains. Extensive experiments demonstrate that LLMs achieve impressive performance in straightforward knowledge QA tasks, while settings and contexts requiring more complex reasoning or employing domain-specific facts still present significant challenges. We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats, and ultimately to understand, evaluate, and improve LLMs' knowledge abilities across a wide spectrum of knowledge domains and tasks.

大型语言模型（LLMs）在知识密集型任务上表现出色，但如何系统评估LLMs的知识能力及其在不同领域和任务中的知识泛化能力仍然不为人所知。为此，我们提出了KGQuiz，这是一个基于知识的全面评估框架，包含了五个任务，从简单到复杂地涵盖了三个领域的知识。通过在KGQuiz基准测试中对十种开源和黑盒LLMs进行广泛实验，我们发现LLMs在简单的知识问答任务中表现出色，但在需要更复杂推理或领域特定事实的设置和上下文中仍然存在挑战。我们将KGQuiz视为一个测试平台，用于分析不同领域和任务格式下性能的微妙变化，并最终理解、评估和改进LLMs在广泛知识领域和任务中的知识能力。

KGQUIZ：评估大型语言模型中编码知识的泛化能力