In recent years, substantial advancements have been made in the development
of large language models, achieving remarkable performance across diverse
tasks. To evaluate the knowledge ability of language models, previous studies
have proposed lots of benchmarks based on question-answering pairs. We argue
that it is not reliable and comprehensive to evaluate language models with a
fixed question or limited paraphrases as the query, since language models are
sensitive to prompt. Therefore, we introduce a novel concept named knowledge
boundary to encompass both prompt-agnostic and prompt-sensitive knowledge
within language models. Knowledge boundary avoids prompt sensitivity in
language model evaluations, rendering them more dependable and robust. To
explore the knowledge boundary for a given model, we propose projected gradient
descent method with semantic constraints, a new algorithm designed to identify
the optimal prompt for each piece of knowledge. Experiments demonstrate a
superior performance of our algorithm in computing the knowledge boundary
compared to existing methods. Furthermore, we evaluate the ability of multiple
language models in several domains with knowledge boundary.

近年来，在大型语言模型的发展方面取得了重大进展，达到了在各种任务中的显著表现。为了评估语言模型的知识能力，先前的研究提出了许多基于问答对的基准。我们认为，使用固定问题或有限的改写作为查询来评估语言模型的可靠性和全面性是不可靠的，因为语言模型对提示敏感。因此，我们引入了一个名为知识边界的新概念，用于涵盖语言模型中的提示不可知和提示敏感的知识。知识边界避免了语言模型评估中的提示敏感性，使其更可靠和稳健。为了探索给定模型的知识边界，我们提出了具有语义约束的投影梯度下降方法，这是一种新的算法，旨在识别每个知识片段的最佳提示。实验证明我们的算法在计算知识边界方面比现有方法表现出更高的性能。此外，我们还通过知识边界评估了多个语言模型在几个领域中的能力。