Current LLM evaluation predominantly performs evaluation with prompts
comprising single problems. We propose multi-problem evaluation as an
additional approach to study the multiple problem handling capabilities of
LLMs. We present a systematic study in this regard by comprehensively examining
7 LLMs on 4 related types of tasks constructed from 6 classification
benchmarks. The 4 task types include traditional single-problem tasks,
homogeneous multi-problem tasks, and two index selection tasks that embed the
multi-problem tasks. We find that LLMs are competent multi-problem solvers:
they generally perform (nearly) as well on multi-problem tasks as on
single-problem tasks. Furthermore, contrary to common expectation, they often
do not suffer from a positional bias with long inputs. This makes multi-problem
prompting a simple and cost-efficient prompting method of practical
significance. However, our results also strongly indicate that LLMs lack true
understanding: they perform significantly worse in the two index selection
tasks than in the multi-problem task under various evaluation settings,
although they can indeed do index selection in general.

当前的 LLM 评估主要通过包含单个问题的提示进行评估。我们提出多问题评估作为研究 LLM 的多问题处理能力的额外方法。我们在这方面进行了系统研究，通过全面考察 4 个相关类型的任务上的 7 个 LLM，这些任务是基于 6 个分类基准构建的。我们发现 LLM 具备良好的多问题解决能力：它们在多问题任务上的表现通常接近或与单问题任务一样好。此外，与常见预期相反，它们在长输入下通常不会出现位置偏差。这使得多问题提示成为一种简单且成本效益高的实用方法。然而，我们的结果还强烈表明 LLM 缺乏真正的理解：在两个索引选择任务中，它们的表现显著不如在多问题任务中，尽管它们在一般情况下确实能够进行索引选择。

同时评估 LLMs 中的多个问题：评估 LLM 能力的新范式

Evaluating LLMs with Multiple Problems at once: A New Paradigm for  Probing LLM Capabilities

Crowdsourcing is a popular method used to estimate ground-truth labels by
collecting noisy labels from workers. In this work, we are motivated by
crowdsourcing applications where each worker can exhibit two levels of accuracy
depending on a task's type. Applying algorithms designed for the traditional
Dawid-Skene model to such a scenario results in performance which is limited by
the hard tasks. Therefore, we first extend the model to allow worker accuracy
to vary depending on a task's unknown type. Then we propose a spectral method
to partition tasks by type. After separating tasks by type, any Dawid-Skene
algorithm (i.e., any algorithm designed for the Dawid-Skene model) can be
applied independently to each type to infer the truth values. We theoretically
prove that when crowdsourced data contain tasks with varying levels of
difficulty, our algorithm infers the true labels with higher accuracy than any
Dawid-Skene algorithm. Experiments show that our method is effective in
practical applications.

本文提出了一种基于谱方法的标签聚类算法，从而在众包任务中提高 Dawid-Skene 模型推理个体正确标签的准确度。