Multiple-choice (MC) tests are an efficient method to assess English
learners. It is useful for test creators to rank candidate MC questions by
difficulty during exam curation. Typically, the difficulty is determined by
having human test takers trial the questions in a pretesting stage. However,
this is expensive and not scalable. Therefore, we explore automated approaches
to rank MC questions by difficulty. However, there is limited data for explicit
training of a system for difficulty scores. Hence, we compare task transfer and
zero-shot approaches: task transfer adapts level classification and reading
comprehension systems for difficulty ranking while zero-shot prompting of
instruction finetuned language models contrasts absolute assessment against
comparative. It is found that level classification transfers better than
reading comprehension. Additionally, zero-shot comparative assessment is more
effective at difficulty ranking than the absolute assessment and even the task
transfer approaches at question difficulty ranking with a Spearman's
correlation of 40.4%. Combining the systems is observed to further boost the
correlation.

在评估英语学习者时，多项选择（MC）测试是一种有效的方法。本文探讨了自动化方法来对 MC 问题进行难度排序，并比较了任务迁移和零样本学习的方法。结果表明，任务迁移在难度排序方面优于阅读理解，而零样本学习方法在问题难度排序方面比绝对评估和任务迁移方法更有效，相关系数为 40.4%。合并这些系统进一步提高了相关性。

多项选择阅读理解的问题难度排名

Question Difficulty Ranking for Multiple-Choice Reading Comprehension

Multiple-choice tests are a common approach for assessing candidates'
comprehension skills. Standard multiple-choice reading comprehension exams
require candidates to select the correct answer option from a discrete set
based on a question in relation to a contextual passage. For appropriate
assessment, the distractor answer options must by definition be incorrect but
plausible and diverse. However, generating good quality distractors satisfying
these criteria is a challenging task for content creators. We propose automated
assessment metrics for the quality of distractors in multiple-choice reading
comprehension tests. Specifically, we define quality in terms of the
incorrectness, plausibility and diversity of the distractor options. We assess
incorrectness using the classification ability of a binary multiple-choice
reading comprehension system. Plausibility is assessed by considering the
distractor confidence - the probability mass associated with the distractor
options for a standard multi-class multiple-choice reading comprehension
system. Diversity is assessed by pairwise comparison of an embedding-based
equivalence metric between the distractors of a question. To further validate
the plausibility metric we compare against candidate distributions over
multiple-choice questions and agreement with a ChatGPT model's interpretation
of distractor plausibility and diversity.

对多选阅读理解测试中干扰项的质量进行自动评估，包括错误性、可信度和多样性的度量。

多项选择测试中的干扰项评估

Assessing Distractors in Multiple-Choice Tests

Multiple-choice reading and listening comprehension tests are an important
part of language assessment. Content creators for standard educational tests
need to carefully curate questions that assess the comprehension abilities of
candidates taking the tests. However, recent work has shown that a large number
of questions in general multiple-choice reading comprehension datasets can be
answered without comprehension, by leveraging world knowledge instead. This
work investigates how much of a contextual passage needs to be read in
multiple-choice reading based on conversation transcriptions and listening
comprehension tests to be able to work out the correct answer. We find that
automated reading comprehension systems can perform significantly better than
random with partial or even no access to the context passage. These findings
offer an approach for content creators to automatically capture the trade-off
between comprehension and world knowledge required for their proposed
questions.

本文研究多项选择阅读理解和听力理解测试中需要阅读多少上下文内容才能回答正确，发现自动化阅读理解系统即使没有或只有部分上下文内容的情况下也能比随机猜测表现更好，并提供了内容创作者自动捕捉所需理解和世界知识之间的权衡的方法。

分析多项选择阅读和听力理解测试

Analyzing Multiple-Choice Reading and Listening Comprehension Tests

The present study aims to explore the capabilities of Language Models (LMs)
in tackling high-stakes multiple-choice tests, represented here by the Exame
Nacional do Ensino Médio (ENEM), a multidisciplinary entrance examination
widely adopted by Brazilian universities. This exam poses challenging tasks for
LMs, since its questions may span into multiple fields of knowledge, requiring
understanding of information from diverse domains. For instance, a question may
require comprehension of both statistics and biology to be solved. This work
analyzed responses generated by GPT-3.5 and GPT-4 models for questions
presented in the 2009-2017 exams, as well as for questions of the 2022 exam,
which were made public after the training of the models was completed.
Furthermore, different prompt strategies were tested, including the use of
Chain-of-Thought (CoT) prompts to generate explanations for answers. On the
2022 edition, the best-performing model, GPT-4 with CoT, achieved an accuracy
of 87%, largely surpassing GPT-3.5 by 11 points. The code and data used on
experiments are available at this https URL.

本研究通过分析 GPT-3.5 和 GPT-4 对 Exame Nacional do Ensino Médio 的表现以及不同提示策略的测试，旨在探讨语言模型在解决跨学科知识问题的高风险选择题方面的能力。 2022 年版的 GPT-4 with CoT 模型表现最佳，精度达到了 87％。