With the continuous evolution and refinement of LLMs, they are endowed with
impressive logical reasoning or vertical thinking capabilities. But can they
think out of the box? Do they possess proficient lateral thinking abilities?
Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation
benchmark, LatEval, which assesses the model's lateral thinking within an
interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the
quality of questions posed by the model and the model's capability to integrate
information for problem-solving. We find that nearly all LLMs struggle with
employing lateral thinking during interactions. For example, even the most
advanced model, GPT-4, exhibits the advantage to some extent, yet still
maintain a noticeable gap when compared to human. This evaluation benchmark
provides LLMs with a highly challenging and distinctive task that is crucial to
an effective AI assistant.

通过最新的评估基准 LatEval，该研究探讨了语言模型在问答提问质量和信息整合方面的潜力，发现大部分模型在运用侧向思考时存在困难，提出了具有挑战性的任务，对于开发高效 AI 助手非常关键。