Causality reveals fundamental principles behind data distributions in
real-world scenarios, and the capability of large language models (LLMs) to
understand causality directly impacts their efficacy across explaining outputs,
adapting to new evidence, and generating counterfactuals. With the
proliferation of LLMs, the evaluation of this capacity is increasingly
garnering attention. However, the absence of a comprehensive benchmark has
rendered existing evaluation studies being straightforward, undiversified, and
homogeneous. To address these challenges, this paper proposes a comprehensive
benchmark, namely CausalBench, to evaluate the causality understanding
capabilities of LLMs. Originating from the causal research community,
CausalBench encompasses three causal learning-related tasks, which facilitate a
convenient comparison of LLMs' performance with classic causal learning
algorithms. Meanwhile, causal networks of varying scales and densities are
integrated in CausalBench, to explore the upper limits of LLMs' capabilities
across task scenarios of varying difficulty. Notably, background knowledge and
structured data are also incorporated into CausalBench to thoroughly unlock the
underlying potential of LLMs for long-text comprehension and prior information
utilization. Based on CausalBench, this paper evaluates nineteen leading LLMs
and unveils insightful conclusions in diverse aspects. Firstly, we present the
strengths and weaknesses of LLMs and quantitatively explore the upper limits of
their capabilities across various scenarios. Meanwhile, we further discern the
adaptability and abilities of LLMs to specific structural networks and complex
chain of thought structures. Moreover, this paper quantitatively presents the
differences across diverse information sources and uncovers the gap between
LLMs' capabilities in causal understanding within textual contexts and
numerical domains.

本研究提出了一个全面的基准测试系统 CausalBench，旨在评估大型语言模型在理解因果关系方面的能力。通过包含三个与因果学习相关的任务，并结合不同难度的任务场景，该系统能够方便地比较多种大型语言模型与经典因果学习算法的性能。研究利用 CausalBench 评估了 19 种领先的大型语言模型，揭示了它们在各个方面的优势和弱点，并定量地探索了它们在不同场景中能力的上限。此外，研究还定量地呈现了不同信息源之间的差异，并揭示了大型语言模型在文本上下文和数值领域中对因果理解能力的差距。