We explore testing the reasoning ability of large language models (LLMs),
such as ChatGPT, by engaging with them in a debate-like conversation that
probes deeper into their understanding of the subject. Specifically, we
formulate a new task where given a question, the LLM can generate a correct
solution while the user believes in a wrong solution in the beginning, and they
need to discuss to make the correct decision through dialogue. Such a setting
requires the LLM to not only achieve the correct answer on its own (which could
be done by shallow memorization), but also be able to defend the truth instead
of blindly believing or getting misled by the user's (invalid) arguments and
critiques, thus testing in greater depth whether the LLM grasps the essence of
the reasoning required to solve the problem. To automate this evaluation
framework and save human labor, we simulate the user using another LLM
conditioned on a synthesized wrong solution. Across a range of complex
reasoning benchmarks spanning math, commonsense, logic and tasks from
BIG-Bench, we find that despite being able to generate correct step-by-step
solutions in the beginning, ChatGPT cannot maintain its belief in truth for a
significant portion of examples when challenged by often-time absurdly invalid
arguments. Our work reveals LLMs' weaknesses not captured by conventional
benchmarking, and also points to danger zones of aligning models with human
feedback.

我们探讨了如何通过进行辩论式的对话来测试大型语言模型（LLM）的推理能力，以此来衡量模型是否真正理解了问题的本质。对多个复杂的推理测试进行实验表明，尽管 ChatGPT 等模型一开始可以生成正确的解决方案，但在面对荒谬的无效论据时，它们无法保持对真理的信仰。