The NLP community has recently shown a growing interest in leveraging Large Language Models (LLMs) for knowledge-intensive tasks, viewing LLMs as potential knowledge bases (KBs). However, the reliability and extent to which LLMs can function as KBs remain underexplored. While previous studies suggest LLMs can encode knowledge within their parameters, the amount of parametric knowledge alone is not sufficient to evaluate their effectiveness as KBs. This study defines criteria that a reliable LLM-as-KB should meet, focusing on factuality and consistency, and covering both seen and unseen knowledge. We develop several metrics based on these criteria and use them to evaluate 26 popular LLMs, while providing a comprehensive analysis of the effects of model size, instruction tuning, and in-context learning (ICL). Our results paint a worrying picture. Even a high-performant model like GPT-3.5-turbo is not factual or consistent, and strategies like ICL and fine-tuning are unsuccessful at making LLMs better KBs.

利用大型语言模型作为知识库的可靠性和效果尚未得到充分研究，该研究通过定义可靠性标准和指标，评估了26个热门语言模型的效果，并发现即使高性能模型如GPT-3.5-turbo也不具备事实性和一致性，而在上下文学习和微调等策略上的努力也未能改善这些语言模型作为知识库的表现。

大型语言模型作为可靠的知识库？