In light of recent breakthroughs in large language models (LLMs) that have
revolutionized natural language processing (NLP), there is an urgent need for
new benchmarks to keep pace with the fast development of LLMs. In this paper,
we propose CFLUE, the Chinese Financial Language Understanding Evaluation
benchmark, designed to assess the capability of LLMs across various dimensions.
Specifically, CFLUE provides datasets tailored for both knowledge assessment
and application assessment. In knowledge assessment, it consists of 38K+
multiple-choice questions with associated solution explanations. These
questions serve dual purposes: answer prediction and question reasoning. In
application assessment, CFLUE features 16K+ test instances across distinct
groups of NLP tasks such as text classification, machine translation, relation
extraction, reading comprehension, and text generation. Upon CFLUE, we conduct
a thorough evaluation of representative LLMs. The results reveal that only
GPT-4 and GPT-4-turbo achieve an accuracy exceeding 60\% in answer prediction
for knowledge assessment, suggesting that there is still substantial room for
improvement in current LLMs. In application assessment, although GPT-4 and
GPT-4-turbo are the top two performers, their considerable advantage over
lightweight LLMs is noticeably diminished. The datasets and scripts associated
with CFLUE are openly accessible at this https URL

我们提出了中文金融语言理解评估基准 CFLUE，用于评估大型语言模型在知识评估和应用评估方面的能力。CFLUE 提供了定制的数据集，用于知识评估和应用评估，并进行了代表性大型语言模型的彻底评估。

在 CFLUE 上对大型语言模型进行基准测试 —— 中文金融语言理解评估数据集

Benchmarking Large Language Models on CFLUE -- A Chinese Financial  Language Understanding Evaluation Dataset

The rapid development of Large Language Models (LLMs) has led to a surge in
applications that facilitate collaboration among multiple agents, assisting
humans in their daily tasks. However, a significant gap remains in assessing to
what extent LLM-powered applications genuinely enhance user experience and task
execution efficiency. This highlights the need to verify utility of LLM-powered
applications, particularly by ensuring alignment between the application's
functionality and end-user needs. We introduce AgentEval, a novel framework
designed to simplify the utility verification process by automatically
proposing a set of criteria tailored to the unique purpose of any given
application. This allows for a comprehensive assessment, quantifying the
utility of an application against the suggested criteria. We present a
comprehensive analysis of the effectiveness and robustness of AgentEval for two
open source datasets including Math Problem solving and ALFWorld House-hold
related tasks. For reproducibility purposes, we make the data, code and all the
logs publicly available at this https URL .

通过提出一套针对特定应用目的的标准，AgentEval 框架可以自动化地简化应用的效用验证过程，从而综合评估和量化该应用程序的效用。