Large Language Models (LLMs) have demonstrated impressive capabilities across
a wide range of tasks. However, their proficiency and reliability in the
specialized domain of Data Analysis, particularly with a focus on data-driven
thinking, remain uncertain. To bridge this gap, we introduce BIBench, a
comprehensive benchmark designed to evaluate the data analysis capabilities of
LLMs within the context of Business Intelligence (BI). BIBench assesses LLMs
across three dimensions: 1) BI foundational knowledge, evaluating the models'
numerical reasoning and familiarity with financial concepts; 2) BI knowledge
application, determining the models' ability to quickly comprehend textual
information and generate analysis questions from multiple views; and 3) BI
technical skills, examining the models' use of technical knowledge to address
real-world data analysis challenges. BIBench comprises 11 sub-tasks, spanning
three categories of task types: classification, extraction, and generation.
Additionally, we've developed BIChat, a domain-specific dataset with over a
million data points, to fine-tune LLMs. We will release BIBenchmark, BIChat,
and the evaluation scripts at https://github.com/cubenlp/BIBench. This
benchmark aims to provide a measure for in-depth analysis of LLM abilities and
foster the advancement of LLMs in the field of data analysis.

为了评估大型语言模型（LLMs）在商业情报领域中数据分析能力方面的表现，研究引入了 BIBench，一种全面的基准测试。BIBench 评估 LLMs 在商业情报基础知识、知识应用和技术技能三个维度上的能力，并且包含 11 个子任务。另外，研究还开发了 BIChat，一个包含百万个数据点的领域特定数据集，用于对 LLMs 进行优化。通过提供一种对 LLMs 能力进行深入分析的度量标准，BIBench 旨在推动 LLMs 在数据分析领域的发展。