Theory of mind (ToM) evaluations currently focus on testing models using
passive narratives that inherently lack interactivity. We introduce FANToM, a
new benchmark designed to stress-test ToM within information-asymmetric
conversational contexts via question answering. Our benchmark draws upon
important theoretical requisites from psychology and necessary empirical
considerations when evaluating large language models (LLMs). In particular, we
formulate multiple types of questions that demand the same underlying reasoning
to identify illusory or false sense of ToM capabilities in LLMs. We show that
FANToM is challenging for state-of-the-art LLMs, which perform significantly
worse than humans even with chain-of-thought reasoning or fine-tuning.

FANToM 是一个基准测试，旨在通过问答来在信息非对称的对话环境中对心智理论进行压力测试。我们利用心理学的重要理论要求和评估大型语言模型时的必要实证考虑制定了多种类型的问题，以确定 LLM 中虚假或错误的心智能力。我们证明 FANToM 对于最先进的 LLM 来说具有挑战性，即使是具有思维连贯性或微调的模型也表现明显较差于人类。