Recently, Large Language Models (LLMs) have made remarkable evolutions in
language understanding and generation. Following this, various benchmarks for
measuring all kinds of capabilities of LLMs have sprung up. In this paper, we
challenge the reasoning and understanding abilities of LLMs by proposing a
FaLlacy Understanding Benchmark (FLUB) containing cunning questions that are
easy for humans to understand but difficult for models to grasp. Specifically,
the cunning questions that FLUB focuses on mainly consist of the tricky,
humorous, and misleading questions collected from the real internet
environment. And we design three tasks with increasing difficulty in the FLUB
benchmark to evaluate the fallacy understanding ability of LLMs. Based on FLUB,
we investigate the performance of multiple representative and advanced LLMs,
reflecting our FLUB is challenging and worthy of more future study. Interesting
discoveries and valuable insights are achieved in our extensive experiments and
detailed analyses. We hope that our benchmark can encourage the community to
improve LLMs' ability to understand fallacies.

我们在这篇论文中通过提出一个包含狡猾问题的 FaLlacy Understanding Benchmark (FLUB) 来挑战大型语言模型的推理和理解能力，该 Benchmark 包含了从真实互联网环境中收集的棘手、幽默和误导性问题，我们设计了三个难度递增的任务，用于评估 LLM 的谬误理解能力。基于 FLUB，我们研究了多个代表性和先进的 LLM 的性能，反映出 FLUB 具有挑战性且值得进行更多的未来研究。通过我们的广泛实验证明和详细分析，我们获得了有趣的发现和有价值的见解。我们希望我们的 Benchmark 能够鼓励社区改进 LLM 的理解谬误的能力。