The rapid rise of Language Models (LMs) has expanded their use in several
applications. Yet, due to constraints of model size, associated cost, or
proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always
feasible. With open, smaller LMs emerging, more applications can leverage their
capabilities, but selecting the right LM can be challenging. This work conducts
an in-depth experimental analysis of the semantic correctness of outputs of 10
smaller, open LMs across three aspects: task types, application domains and
reasoning types, using diverse prompt styles. We demonstrate that most
effective models and prompt styles vary depending on the specific requirements.
Our analysis provides a comparative assessment of LMs and prompt styles using a
proposed three-tier schema of aspects for their strategic selection based on
use-case and other constraints. We also show that if utilized appropriately,
these LMs can compete with, and sometimes outperform, SOTA LLMs like
DeepSeek-v2, GPT-3.5-Turbo, and GPT-4o.

使用十个较小、开放式的语言模型在任务类型、应用领域和推理类型等三个方面进行了深入的实验分析，比较评估了语言模型和提示样式，并且展示了这些模型在特定需求下的有效性，以及与 SOTA 语言模型的竞争能力。

评估开放式语言模型在任务类型、应用领域和推理类型方面的性能：一项深入实验分析

Evaluating Open Language Models Across Task Types, Application Domains,  and Reasoning Types: An In-Depth Experimental Analysis

LLM-as-a-judge approaches are a practical and effective way of assessing a
range of text tasks, aligning with human judgements especially when applied in
a comparative assessment fashion. However, when using pairwise comparisons to
rank a set of candidates the computational costs scale quadratically with the
number of candidates, which can have practical limitations. This paper
introduces a Product of Expert (PoE) framework for efficient LLM Comparative
Assessment. Here individual comparisons are considered experts that provide
information on a pair's score difference. The PoE framework combines the
information from these experts to yield an expression that can be maximized
with respect to the underlying set of candidates, and is highly flexible where
any form of expert can be assumed. When Gaussian experts are used one can
derive simple closed-form solutions for the optimal candidate ranking, as well
as expressions for selecting which comparisons should be made to maximize the
probability of this ranking. Our approach enables efficient comparative
assessment, where by using only a small subset of the possible comparisons, one
can generate score predictions that correlate as well to human judgements as
the predictions when all comparisons are used. We evaluate the approach on
multiple NLG tasks and demonstrate that our framework can yield considerable
computational savings when performing pairwise comparative assessment. When N
is large, with as few as 2% of comparisons the PoE solution can achieve similar
performance to when all comparisons are used.

使用 LLM 法作为评估者的方法是一种实际有效的方式，尤其当以比较评估的方式应用时，能与人类评判相符。本文引入了一种高效的 LLM 比较评估的专家模型（PoE），通过结合不同专家的信息，可得到一个可最大化与潜在候选集相关的表达式，具有高度灵活性，可以适应不同类型的专家。使用高斯专家时，可以导出最优候选排名的简单闭式解，以及选择哪些比较可以最大化该排名的概率的表达式。我们的方法能够实现高效的比较评估，只需使用可能比较的一个小子集，即可生成与全部比较使用时相似相关性的分数预测。我们在多个自然语言生成任务上评估了这一方法，并证明了我们的框架在执行成对比较评估时能够实现可观的计算节省。当 N 很大时，仅使用 2% 的比较，PoE 解法也能达到与使用全部比较相似的性能。

高效 LLM 比较评估：基于专家框架的配对比较

Efficient LLM Comparative Assessment: a Product of Experts Framework for  Pairwise Comparisons

Evaluating Natural Language Generation (NLG) outputs is crucial but laborious
and expensive. While various automatic NLG assessment methods have been
proposed, they often are quite task-specific and have to be engineered with a
particular domain and attribute in mind. In this work, we propose a robust
zero-shot approach to NLG evaluation using pairwise comparative judgment with
open-source Large Language Models (LLMs). The motivation for this approach is
that even as humans, it is easier to determine which of two options are better,
than it is to independently objectively score each option. We use this insight
and leverage the emergent abilities of LLMs, where we probe FlanT5 to determine
which of two candidate responses is better, rather than assigning absolute
scores. Our results demonstrate that comparative assessment is a more effective
approach than absolute scoring, enabling smaller open-source LLMs to achieve
comparable performance to larger public access APIs. We evaluate systems on
both summary evaluation and dialogue response generation, and show that
opensource LLMs can lead to good correlations with human scores for a range of
different attributes.

通过用自然语言生成技术输出的对比评估方法来检验大型自然语言模型的表现，使得在不依靠特定域和属性情况下进行评估成为可能。