Large Language Models (LLMs) have unlocked new capabilities and applications;
however, evaluating the alignment with human preferences still poses
significant challenges. To address this issue, we introduce Chatbot Arena, an
open platform for evaluating LLMs based on human preferences. Our methodology
employs a pairwise comparison approach and leverages input from a diverse user
base through crowdsourcing. The platform has been operational for several
months, amassing over 240K votes. This paper describes the platform, analyzes
the data we have collected so far, and explains the tried-and-true statistical
methods we are using for efficient and accurate evaluation and ranking of
models. We confirm that the crowdsourced questions are sufficiently diverse and
discriminating and that the crowdsourced human votes are in good agreement with
those of expert raters. These analyses collectively establish a robust
foundation for the credibility of Chatbot Arena. Because of its unique value
and openness, Chatbot Arena has emerged as one of the most referenced LLM
leaderboards, widely cited by leading LLM developers and companies. Our demo is
publicly available at https://chat.lmsys.org.

Chatbot Arena 是一种基于人类偏好评估大型语言模型的开放平台，通过对接受众来源的成对比较和众包输入的方式收集数据，并使用经过验证的统计方法进行评估和排名，以确保其可靠性和可信度，成为最有价值和最引用的大型语言模型排行榜之一。

Chatbot Arena: 通过人类偏好评估 LLM 的开放平台

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Recent advances in large language models (LLMs) have led to the development
of various evaluation benchmarks. These benchmarks typically rely on a single
instruction template for evaluating all LLMs on a specific task. In this paper,
we comprehensively analyze the brittleness of results obtained via
single-prompt evaluations across 6.5M instances, involving 20 different LLMs
and 39 tasks from 3 benchmarks. To improve robustness of the analysis, we
propose to evaluate LLMs with a set of diverse prompts instead. We discuss
tailored evaluation metrics for specific use cases (e.g., LLM developers vs.
developers interested in a specific downstream task), ensuring a more reliable
and meaningful assessment of LLM capabilities. We then implement these criteria
and conduct evaluations of multiple models, providing insights into the true
strengths and limitations of current LLMs.

通过综合分析来自 3 个评估基准的 39 项任务、20 种不同的大型语言模型和 650 万个实例的单提示评估结果的脆弱性，我们提出使用一套多样的提示来评估大型语言模型，为特定的使用场景（例如 LLM 开发人员与对特定下游任务感兴趣的开发人员）设计定制化的评估指标，从而增强对当前大型语言模型真实优势和限制的准确可靠的评估。同时，我们实施了这些标准并对多个模型进行了评估，为当前大型语言模型的真正优势和限制提供了深入的见解。

当下现状？对多指令语言模型评估的呼吁

State of What Art? A Call for Multi-Prompt LLM Evaluation

The advent of large language models (LLMs) and their adoption by the legal
community has given rise to the question: what types of legal reasoning can
LLMs perform? To enable greater study of this question, we present LegalBench:
a collaboratively constructed legal reasoning benchmark consisting of 162 tasks
covering six different types of legal reasoning. LegalBench was built through
an interdisciplinary process, in which we collected tasks designed and
hand-crafted by legal professionals. Because these subject matter experts took
a leading role in construction, tasks either measure legal reasoning
capabilities that are practically useful, or measure reasoning skills that
lawyers find interesting. To enable cross-disciplinary conversations about LLMs
in the law, we additionally show how popular legal frameworks for describing
legal reasoning -- which distinguish between its many forms -- correspond to
LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary.
This paper describes LegalBench, presents an empirical evaluation of 20
open-source and commercial LLMs, and illustrates the types of research
explorations LegalBench enables.

介绍了 LegalBench，对 20 个开源和商业的大型语言模型进行了实证评估，并展示了 LegalBench 所提供的研究探索类型。