GPT-4 demonstrates high accuracy in medical QA tasks, leading with an
accuracy of 86.70%, followed by Med-PaLM 2 at 86.50%. However, around 14% of
errors remain. Additionally, current works use GPT-4 to only predict the
correct option without providing any explanation and thus do not provide any
insight into the thinking process and reasoning used by GPT-4 or other LLMs.
Therefore, we introduce a new domain-specific error taxonomy derived from
collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset
comprises 4153 GPT-4 correct responses and 919 incorrect responses to the
United States Medical Licensing Examination (USMLE) respectively. These
responses are quite long (258 words on average), containing detailed
explanations from GPT-4 justifying the selected option. We then launch a
large-scale annotation study using the Potato annotation platform and recruit
44 medical experts through Prolific, a well-known crowdsourcing platform. We
annotated 300 out of these 919 incorrect data points at a granular level for
different classes and created a multi-label span to identify the reasons behind
the error. In our annotated dataset, a substantial portion of GPT-4's incorrect
responses is categorized as a "Reasonable response by GPT-4," by annotators.
This sheds light on the challenge of discerning explanations that may lead to
incorrect options, even among trained medical professionals. We also provide
medical concepts and medical semantic predications extracted using the SemRep
tool for every data point. We believe that it will aid in evaluating the
ability of LLMs to answer complex medical questions. We make the resources
available at this https URL .

GPT-4 在医疗问答任务中表现出高准确性，但仍存在约 14% 的错误。因此，我们引入了一种与医学学生合作得出的新的领域特定错误分类系统。我们的 GPT-4 USMLE Error（G4UE）数据集包含 4153 个 GPT-4 正确回答和 919 个不正确回答的美国医学执照考试（USMLE）问题。这些回答较长（平均 258 个单词），包含了 GPT-4 解释所选选项的详细说明。通过大规模的注释研究和医学专家的参与，我们对其中 300 个不正确的数据点进行了细致的分类注释，以确定错误背后的原因。我们的注释数据集中，相当一部分 GPT-4 的不正确回答被注释为 “GPT-4 合理回答”，这揭示了在受过训练的医学专业人士中，辨别可能导致错误选项的解释所面临的挑战。我们还提供了使用 SemRep 工具提取的医学概念和医学语义预测，这将有助于评估语言模型对复杂医学问题的回答能力。我们将这些资源提供在指定的网址。

超越准确性：探究 GPT-4 对 USMLE 问题的错误类型

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE  Questions

Meeting summarization has become a critical task considering the increase in
online interactions. While new techniques are introduced regularly, their
evaluation uses metrics not designed to capture meeting-specific errors,
undermining effective evaluation. This paper investigates what the frequently
used automatic metrics capture and which errors they mask by correlating
automatic metric scores with human evaluations across a broad error taxonomy.
We commence with a comprehensive literature review on English meeting
summarization to define key challenges like speaker dynamics and contextual
turn-taking and error types such as missing information and linguistic
inaccuracy, concepts previously loosely defined in the field. We examine the
relationship between characteristic challenges and errors by using annotated
transcripts and summaries from Transformer-based sequence-to-sequence and
autoregressive models from the general summary QMSum dataset. Through
experimental validation, we find that different model architectures respond
variably to challenges in meeting transcripts, resulting in different
pronounced links between challenges and errors. Current default-used metrics
struggle to capture observable errors, showing weak to mid-correlations, while
a third of the correlations show trends of error masking. Only a subset reacts
accurately to specific errors, while most correlations show either
unresponsiveness or failure to reflect the error's impact on summary quality.

会议总结的关键任务是识别和提取关键字，但现有评估指标无法准确捕捉会议特定错误，本文通过对人工评估和自动评估的相关性研究，揭示自动指标无法捕捉可观测错误且掩盖了某些错误，同时发现不同模型架构对会议文件中的挑战有不同的响应，存在明显的挑战与错误之间的联系。

探索会议总结的自动评价指标

What's under the hood: Investigating Automatic Metrics on Meeting  Summarization

With a growing focus on morphological inflection systems for languages where
high-quality data is scarce, training data noise is a serious but so far
largely ignored concern. We aim at closing this gap by investigating the types
of noise encountered within a pipeline for truly unsupervised morphological
paradigm completion and its impact on morphological inflection systems: First,
we propose an error taxonomy and annotation pipeline for inflection training
data. Then, we compare the effect of different types of noise on multiple
state-of-the-art inflection models. Finally, we propose a novel character-level
masked language modeling (CMLM) pretraining objective and explore its impact on
the models' resistance to noise. Our experiments show that various
architectures are impacted differently by separate types of noise, but
encoder-decoders tend to be more robust to noise than models trained with a
copy bias. CMLM pretraining helps transformers, but has lower impact on LSTMs.

本文旨在探讨稀缺高质量数据的语言中的形态学屈折系统，包括对非监督形态学范式完成的管道内遇到的噪声类型进行了错误分类和注释流程、比较不同类型噪声对现有最新型变型模型的影响、再提出使用字符级屏蔽语言建模（CMLM）预训练目标探索其对模型的抗噪性的影响。实验发现，各种建筑物受到不同类型的噪声的影响不同，但编码器解码器比具有复制偏差的模型更为稳健。CMLM 预训练有助于变压器，但对 LSTM 影响较小。