We propose WorldSense, a benchmark designed to assess the extent to which
LLMs are consistently able to sustain tacit world models, by testing how they
draw simple inferences from descriptions of simple arrangements of entities.
Worldsense is a synthetic benchmark with three problem types, each with their
own trivial control, which explicitly avoids bias by decorrelating the abstract
structure of problems from the vocabulary and expressions, and by decorrelating
all problem subparts with the correct response. We run our benchmark on three
state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these
models make errors even with as few as three objects. Furthermore, they have
quite heavy response biases, preferring certain responses irrespective of the
question. Errors persist even with chain-of-thought prompting and in-context
learning. Lastly, we show that while finetuning on similar problems does result
in substantial improvements -- within- and out-of-distribution -- the finetuned
models do not generalise beyond a constraint problem space.

我们提出了 WorldSense，这是一个用于评估 LLMs 在从简单实体排列的描述中进行简单推理时所能维持的隐式世界模型的程度的基准测试。我们在三个最先进的聊天 LLMs（GPT3.5，GPT4 和 Llama2-chat）上运行我们的基准测试，并显示这些模型在只有三个对象时也会出错。此外，它们具有相当大的响应偏差，无论问题如何，它们都更喜欢特定的响应。错误甚至在思维链提示和上下文学习中仍然存在。最后，我们展示了虽然在类似问题上进行微调确实带来了可观的改进 —— 在内部和超出分布范围内 —— 但是微调的模型并没有超越约束问题空间的普适性。