The advancement of large language models (LLMs) has enhanced the ability to
generalize across a wide range of unseen natural language processing (NLP)
tasks through instruction-following. Yet, their effectiveness often diminishes
in low-resource languages like Chinese, exacerbated by biased evaluations from
data leakage, casting doubt on their true generalizability to new linguistic
territories. In response, we introduce the Chinese Instruction-Following
Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of
LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000
input-output pairs, developed by native speakers to test complex reasoning and
Chinese cultural nuances across 20 categories. To mitigate evaluation bias, we
release only half of the dataset publicly, with the remainder kept private, and
introduce diversified instructions to minimize score variance, totaling 45,000
data instances. Our evaluation of 28 selected LLMs reveals a noticeable
performance gap, with the best model scoring only 52.9%, highlighting the
limitations of LLMs in less familiar language and task contexts. This work aims
to uncover the current limitations of LLMs in handling Chinese tasks, pushing
towards the development of more culturally informed and linguistically diverse
models with the released data and benchmark
(this https URL).

LLMs 在处理中文任务方面存在限制，该研究引入了中文指令跟踪基准（CIF-Bench），评估 LLMs 对中文语言的零射击泛化能力，并揭示出评估偏差和性能差距问题。

CIF-Bench：一个用于评估大型语言模型通用性的中文指令遵循基准

CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the  Generalizability of Large Language Models

As large language models (LLMs) continue to advance, accurately and
comprehensively evaluating their performance becomes increasingly challenging.
Conventionally, human evaluations are considered the gold standard in natural
language generation. Recent advancements incorporate state-of-the-art LLMs as
proxies for human judges in evaluation processes. Nonetheless, the extent to
which humans and LLMs are capable evaluators remains uncertain. This study aims
to investigate the behavior of both crowd-sourced human and LLM-based judges
when comparing outputs from different models. To accomplish this, we curate a
dataset comprising intentionally flawed machine-generated answers. Our findings
indicate that despite the potentially greater danger posed by factual errors,
answers with factual errors were still rated more favorably compared to answers
that were too short or contained grammatical errors. This highlights a
concerning bias in the evaluation process. To address this issue, we propose to
independently evaluate machine-generated text across multiple dimensions,
rather than merging all the evaluation aspects into a single score. We
instantiate this idea with the Elo rating system, resulting in the Multi-Elo
Rating System. Empirical results from our study reveal that this proposed
approach significantly enhances the quality of LLM-based evaluations,
particularly in terms of factual accuracy. However, notable improvement is not
observed in crowd-sourced-based evaluations, suggesting the need for further
investigation and refinement.

在评估自然语言生成的过程中，使用大型语言模型 (LLMs) 作为人类评判的替代方法是一种最新的趋势。然而，本研究发现其评估结果存在偏见。为解决这一问题，提出了多维度独立评估系统 (Multi-Elo Rating System)，在提高 LLM 评估质量方面取得了显著成效，但对众包评估没有明显改善，需要进一步探索和改进。

形式胜于内容：大型语言模型的评估偏见

Style Over Substance: Evaluation Biases for Large Language Models

We uncover a systematic bias in the evaluation paradigm of adopting large
language models~(LLMs), e.g., GPT-4, as a referee to score the quality of
responses generated by candidate models. We find that the quality ranking of
candidate responses can be easily hacked by simply altering their order of
appearance in the context. This manipulation allows us to skew the evaluation
result, making one model appear considerably superior to the other, e.g.,
vicuna could beat ChatGPT on 66 over 80 tested queries. To address this issue,
we propose two simple yet effective calibration strategies: 1) Multiple
Evidence Calibration, which requires the evaluator model to generate multiple
detailed pieces of evidence before assigning ratings; 2) Balanced Position
Calibration, which aggregates results across various orders to determine the
final score. Extensive experiments demonstrate that our approach successfully
mitigates evaluation bias, resulting in closer alignment with human judgments.
To facilitate future research on more robust large language model comparison,
we integrate the techniques in the paper into an easy-to-use toolkit
\emph{FairEval}, along with the human
annotations.\footnote{https://github.com/i-Eval/FairEval}

本文发现了采用大型语言模型（LLMs）作为评判器来评分候选模型生成内容质量的评估范式中的系统偏差。作者提出了两种校准策略来解决这个问题。经过广泛实验，这种方法成功缓解了评估偏差，与人类判断更加接近。为了促进更加强大的大型语言模型比较的未来研究，作者将文章中的技术集成到一个易于使用的工具包 FairEval 中，同时结合了人工注释。