Motivated by the rapid ascent of Large Language Models (LLMs) and debates
about the extent to which they possess human-level qualities, we propose a
framework for testing whether any agent (be it a machine or a human)
understands a subject matter. In Turing-test fashion, the framework is based
solely on the agent's performance, and specifically on how well it answers
questions. Elements of the framework include circumscribing the set of
questions (the "scope of understanding"), requiring general competence
("passing grade"), avoiding "ridiculous answers", but still allowing wrong and
"I don't know" answers to some questions. Reaching certainty about these
conditions requires exhaustive testing of the questions which is impossible for
nontrivial scopes, but we show how high confidence can be achieved via random
sampling and the application of probabilistic confidence bounds. We also show
that accompanying answers with explanations can improve the sample complexity
required to achieve acceptable bounds, because an explanation of an answer
implies the ability to answer many similar questions. According to our
framework, current LLMs cannot be said to understand nontrivial domains, but as
the framework provides a practical recipe for testing understanding, it thus
also constitutes a tool for building AI agents that do understand.

通过随机采样和应用概率置信边界，我们提出了一个测试任何机器或人类是否理解一门主题的框架，其中包括确定问题范围、要求一般能力和避免荒谬答案，但允许某些问题的错误和 “我不知道” 答案。根据我们的框架，目前的大型语言模型不能说理解非平凡领域，但这个框架提供了一个测试理解的实用方法，也是构建理解型人工智能代理的工具。