Recently, Large Language Models (LLMs) attained impressive performance in
math and reasoning benchmarks. However, they still often struggle with logic
problems and puzzles that are relatively easy for humans. To further
investigate this, we introduce a new benchmark, SearchBench, containing 11
unique search problem types, each equipped with automated pipelines to generate
an arbitrary number of instances and analyze the feasibility, correctness, and
optimality of LLM-generated solutions. We show that even the most advanced LLMs
fail to solve these problems end-to-end in text, e.g. GPT4 solves only 1.4%.
SearchBench problems require considering multiple pathways to the solution as
well as backtracking, posing a significant challenge to auto-regressive models.
Instructing LLMs to generate code that solves the problem helps, but only
slightly, e.g., GPT4's performance rises to 11.7%. In this work, we show that
in-context learning with A* algorithm implementations enhances performance. The
full potential of this promoting approach emerges when combined with our
proposed Multi-Stage-Multi-Try method, which breaks down the algorithm
implementation into two stages and verifies the first stage against unit tests,
raising GPT-4's performance above 57%.

最近，大型语言模型在数学和推理基准测试中取得了令人瞩目的表现。但是，它们在对人类而言相对容易的逻辑问题和谜题上仍然经常遇到困难。为了进一步研究这个问题，我们引入了一个名为 SearchBench 的新基准测试，其中包含 11 种独特的搜索问题类型，每种问题类型都配备了自动化流程来生成任意数量的实例，并分析 LLM 生成解决方案的可行性、正确性和最优性。我们发现，即使是最先进的 LLM 也无法完全以文本方式解决这些问题，例如 GPT4 只解决了 1.4% 的问题。SearchBench 的问题要求考虑到多个解决路径以及回溯，这对自回归模型构成了重大挑战。指导 LLM 生成解决问题的代码会有所帮助，但是仅有轻微的改进，例如 GPT4 的表现提升到了 11.7%。在这项工作中，我们展示了利用 A * 算法实现的上下文学习如何提高性能。当将这种优化方法与我们提出的多阶段多尝试方法相结合时，它的潜力得到了充分展现，将 GPT-4 的表现提升到了 57% 以上。