Large Language Models (LLMs) have shown the potential to significantly
contribute to patient care, diagnostics, and administrative processes. Emerging
biomedical LLMs address healthcare-specific challenges, including privacy
demands and computational constraints. However, evaluation of these models has
primarily been limited to non-clinical tasks, which do not reflect the
complexity of practical clinical applications. Additionally, there has been no
thorough comparison between biomedical and general-domain LLMs for clinical
tasks. To fill this gap, we present the Clinical Language Understanding
Evaluation (CLUE), a benchmark tailored to evaluate LLMs on real-world clinical
tasks. CLUE includes two novel datasets derived from MIMIC IV discharge letters
and four existing tasks designed to test the practical applicability of LLMs in
healthcare settings. Our evaluation covers several biomedical and general
domain LLMs, providing insights into their clinical performance and
applicability. CLUE represents a step towards a standardized approach to
evaluating and developing LLMs in healthcare to align future model development
with the real-world needs of clinical application. We publish our evaluation
and data generation scripts: this https URL

为填补现有研究中缺乏对医疗领域广泛应用的临床任务的评估的空白，我们提出了一种适用于现实世界临床任务的基准测试工具 CLUE，并通过评估多个生物医学和通用领域 LLMs 的临床表现和适用性，推进医疗领域的 LLMs 评估和开发的标准化方法。

CLUE: 用于 LLMs 的临床语言理解评估

CLUE: A Clinical Language Understanding Evaluation for LLMs

To interpret uncertainty estimates from differentiable probabilistic models,
recent work has proposed generating Counterfactual Latent Uncertainty
Explanations (CLUEs). However, for a single input, such approaches could output
a variety of explanations due to the lack of constraints placed on the
explanation. Here we augment the original CLUE approach, to provide what we
call $\delta$-CLUE. CLUE indicates $\it{one}$ way to change an input, while
remaining on the data manifold, such that the model becomes more confident
about its prediction. We instead return a $\it{set}$ of plausible CLUEs:
multiple, diverse inputs that are within a $\delta$ ball of the original input
in latent space, all yielding confident predictions.

通过扩展 CLUE 方法，我们提出了 δ-CLUE 来提供多个潜在的解释，使得模型对预测的结果更加有信心，从而更好地解释不确定性估计和概率模型。

δ-CLUE: 不确定性估计的多样解释集

δ-CLUE: Diverse Sets of Explanations for Uncertainty Estimates

The advent of natural language understanding (NLU) benchmarks for English,
such as GLUE and SuperGLUE allows new NLU models to be evaluated across a
diverse set of tasks. These comprehensive benchmarks have facilitated a broad
range of research and applications in natural language processing (NLP). The
problem, however, is that most such benchmarks are limited to English, which
has made it difficult to replicate many of the successes in English NLU for
other languages. To help remedy this issue, we introduce the first large-scale
Chinese Language Understanding Evaluation (CLUE) benchmark. CLUE is an
open-ended, community-driven project that brings together 9 tasks spanning
several well-established single-sentence/sentence-pair classification tasks, as
well as machine reading comprehension, all on original Chinese text. To
establish results on these tasks, we report scores using an exhaustive set of
current state-of-the-art pre-trained Chinese models (9 in total). We also
introduce a number of supplementary datasets and additional tools to help
facilitate further progress on Chinese NLU. Our benchmark is released at
this https URL

该论文介绍了第一个大规模的中文语言理解评估基准，名为 CLUE，以帮助解决英语特定的自然语言理解模型难以用于其他语言的问题，并使用 9 个最先进的中文预训练模型来报告结果，并引入了一系列辅助数据集和工具以促进中文自然语言理解技术的进一步发展。