Large Language Models (LLMs) have demonstrated impressive capability in many
nature language tasks. However, the auto-regressive generation process makes
LLMs prone to produce errors, hallucinations and inconsistent statements when
performing multi-step reasoning. In this paper, we aim to alleviate the
pathology by introducing Q*, a general, versatile and agile framework for
guiding LLMs decoding process with deliberative planning. By learning a
plug-and-play Q-value model as heuristic function, our Q* can effectively guide
LLMs to select the most promising next step without fine-tuning LLMs for each
task, which avoids the significant computational overhead and potential risk of
performance degeneration on other tasks. Extensive experiments on GSM8K, MATH
and MBPP confirm the superiority of our method.

通过引入 Q* 框架，我们可以缓解大型语言模型在多步推理时产生的错误、幻觉和不一致陈述的问题。Q* 是一个通用、多功能和灵活的框架，通过学习一个插拔式的 Q-value 模型作为启发式函数，有效地指导大型语言模型选择最有前途的下一步，避免了对每个任务进行大型语言模型微调所带来的计算开销和性能退化的潜在风险。在 GSM8K、MATH 和 MBPP 三个任务上的大量实验证明了我们方法的优越性。