As the capabilities of large language models (LLMs) continue to advance,
evaluating their performance becomes increasingly crucial and challenging. This
paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese
benchmark that covers various subjects, including natural science, social
sciences, engineering, and humanities. We conduct a thorough evaluation of 18
advanced multilingual- and Chinese-oriented LLMs, assessing their performance
across different subjects and settings. The results reveal that most existing
LLMs struggle to achieve an average accuracy of 50%, even when provided with
in-context examples and chain-of-thought prompts, whereas the random baseline
stands at 25%. This highlights significant room for improvement in LLMs.
Additionally, we conduct extensive experiments to identify factors impacting
the models' performance and propose directions for enhancing LLMs. CMMLU fills
the gap in evaluating the knowledge and reasoning capabilities of large
language models within the Chinese context.

本文介绍了一个涵盖自然科学、社会科学、工程学和人文学科等多个领域的全面中文基准 CMMLU，并通过评估 18 种面向性能的多语言和中文 LLMs，在不同的主题和设置下评估它们的性能，结果显示，大多数现有 LLM 在提供上下文示例和思维链提示时仍然难以达到 50% 的平均准确性，而随机基准线为 25%，这凸显出 LLMs 有显着的改进空间。

CMMLU: 用于测量中文海量多任务语言理解的工具

CMMLU: Measuring massive multitask language understanding in Chinese

In this paper, we study the problem of knowledge-intensive text-to-SQL, in
which domain knowledge is necessary to parse expert questions into SQL queries
over domain-specific tables. We formalize this scenario by building a new
Chinese benchmark KnowSQL consisting of domain-specific questions covering
various domains. We then address this problem by presenting formulaic
knowledge, rather than by annotating additional data examples. More concretely,
we construct a formulaic knowledge bank as a domain knowledge base and propose
a framework (ReGrouP) to leverage this formulaic knowledge during parsing.
Experiments using ReGrouP demonstrate a significant 28.2% improvement overall
on KnowSQL.

本文使用新的中文基准数据集 KnowSQL，提出了使用公式化知识库作为领域知识支持的重新分组（ReGrouP）框架来解决文本到 SQL 的知识密集问题，并在 KnowSQL 数据集上实现了 28.2％的整体性能提升。

面向知识密集型文本 - 结构化查询语义解析的公式化知识方法

Towards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic Knowledge

Artificial Intelligence (AI), along with the recent progress in biomedical
language understanding, is gradually changing medical practice. With the
development of biomedical language understanding benchmarks, AI applications
are widely used in the medical field. However, most benchmarks are limited to
English, which makes it challenging to replicate many of the successes in
English for other languages. To facilitate research in this direction, we
collect real-world biomedical data and present the first Chinese Biomedical
Language Understanding Evaluation (CBLUE) benchmark: a collection of natural
language understanding tasks including named entity recognition, information
extraction, clinical diagnosis normalization, single-sentence/sentence-pair
classification, and an associated online platform for model evaluation,
comparison, and analysis. To establish evaluation on these tasks, we report
empirical results with the current 11 pre-trained Chinese models, and
experimental results show that state-of-the-art neural models perform by far
worse than the human ceiling. Our benchmark is released at
\url{this https URL&lang=en-us}.

本文介绍了第一个中文生物医学语言理解基准评估（CBLUE），其涵盖了一系列自然语言处理任务，包括命名实体识别、信息抽取、临床诊断标准化、单句 / 句对分类，与相应的在线平台进行模型评估、比较和分析，并通过当前的 11 个预训练中文模型的实证结果表明，优秀的神经模型表现远低于人类水平。