Recently, Large Language Models (LLMs) attained impressive performance in
math and reasoning benchmarks. However, they still often struggle with logic
problems and puzzles that are relatively easy for humans. To further
investigate this, we introduce a new benchmark, SearchBench, containing 11
unique search problem types, each equipped with automated pipelines to generate
an arbitrary number of instances and analyze the feasibility, correctness, and
optimality of LLM-generated solutions. We show that even the most advanced LLMs
fail to solve these problems end-to-end in text, e.g. GPT4 solves only 1.4%.
SearchBench problems require considering multiple pathways to the solution as
well as backtracking, posing a significant challenge to auto-regressive models.
Instructing LLMs to generate code that solves the problem helps, but only
slightly, e.g., GPT4's performance rises to 11.7%. In this work, we show that
in-context learning with A* algorithm implementations enhances performance. The
full potential of this promoting approach emerges when combined with our
proposed Multi-Stage-Multi-Try method, which breaks down the algorithm
implementation into two stages and verifies the first stage against unit tests,
raising GPT-4's performance above 57%.

最近，大型语言模型在数学和推理基准测试中取得了令人瞩目的表现。但是，它们在对人类而言相对容易的逻辑问题和谜题上仍然经常遇到困难。为了进一步研究这个问题，我们引入了一个名为 SearchBench 的新基准测试，其中包含 11 种独特的搜索问题类型，每种问题类型都配备了自动化流程来生成任意数量的实例，并分析 LLM 生成解决方案的可行性、正确性和最优性。我们发现，即使是最先进的 LLM 也无法完全以文本方式解决这些问题，例如 GPT4 只解决了 1.4% 的问题。SearchBench 的问题要求考虑到多个解决路径以及回溯，这对自回归模型构成了重大挑战。指导 LLM 生成解决问题的代码会有所帮助，但是仅有轻微的改进，例如 GPT4 的表现提升到了 11.7%。在这项工作中，我们展示了利用 A * 算法实现的上下文学习如何提高性能。当将这种优化方法与我们提出的多阶段多尝试方法相结合时，它的潜力得到了充分展现，将 GPT-4 的表现提升到了 57% 以上。

导航迷宫：评估和提高 LLMs 处理搜索问题的能力

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to  Reason About Search Problems

A comparison between three chatbots which are based on large language models,
namely ChatGPT-3.5, ChatGPT-4 and Google Bard is presented, focusing on their
ability to give correct answers to mathematics and logic problems. In
particular, we check their ability to Understand the problem at hand; Apply
appropriate algorithms or methods for its solution; and Generate a coherent
response and a correct answer. We use 30 questions that are clear, without any
ambiguities, fully described with plain text only, and have a unique, well
defined correct answer. The questions are divided into two sets of 15 each. The
questions of Set A are 15 "Original" problems that cannot be found online,
while Set B contains 15 "Published" problems that one can find online, usually
with their solution. Each question is posed three times to each chatbot. The
answers are recorded and discussed, highlighting their strengths and
weaknesses. It has been found that for straightforward arithmetic, algebraic
expressions, or basic logic puzzles, chatbots may provide accurate solutions,
although not in every attempt. However, for more complex mathematical problems
or advanced logic tasks, their answers, although written in a usually
"convincing" way, may not be reliable. Consistency is also an issue, as many
times a chatbot will provide conflicting answers when given the same question
more than once. A comparative quantitative evaluation of the three chatbots is
made through scoring their final answers based on correctness. It was found
that ChatGPT-4 outperforms ChatGPT-3.5 in both sets of questions. Bard comes
third in the original questions of Set A, behind the other two chatbots, while
it has the best performance (first place) in the published questions of Set B.
This is probably because Bard has direct access to the internet, in contrast to
ChatGPT chatbots which do not have any communication with the outside world.

对基于大型语言模型的三个聊天机器人（ChatGPT-3.5、ChatGPT-4 和 Google Bard）进行了比较，重点关注它们解决数学和逻辑问题的能力，并通过一系列测试发现对于简单的算术、代数表达式和基本的逻辑谜题，聊天机器人可能会提供准确的解决方案，但对于更复杂的数学问题或高级逻辑任务，它们的答案可能不可靠。ChatGPT-4 在两组问题中的表现均优于 ChatGPT-3.5，而 Bard 在 Set B 中表现最好。