The evaluation of Large Language Models (LLMs) is a key element in their
continuous improvement process and many benchmarks have been developed to
assess the performance of LLMs in different tasks and topics. As LLMs become
adopted worldwide, evaluating them in languages other than English is
increasingly important. However, most LLM benchmarks are simply translated
using an automated tool and then run in the target language. This means that
the results depend not only on the LLM performance in that language but also on
the quality of the translation. In this paper, we consider the case of the
well-known Massive Multitask Language Understanding (MMLU) benchmark. Selected
categories of the benchmark are translated into Spanish using Azure Translator
and ChatGPT4 and run on ChatGPT4. Next, the results are processed to identify
the test items that produce different answers in Spanish and English. Those are
then analyzed manually to understand if the automatic translation caused the
change. The results show that a significant fraction of the failing items can
be attributed to mistakes in the translation of the benchmark. These results
make a strong case for improving benchmarks in languages other than English by
at least revising the translations of the items and preferably by adapting the
tests to the target language by experts.

评估大型语言模型在其他语言中表现的质量，并修正翻译错误以及适应目标语言的测试项是改进非英语语言基准测试的关键。

西班牙语和 LLM 基准：MMLU 是否被翻译迷失？

Spanish and LLM Benchmarks: is MMLU Lost in Translation?

The rapid rise in popularity of Large Language Models (LLMs) with emerging
capabilities has spurred public curiosity to evaluate and compare different
LLMs, leading many researchers to propose their LLM benchmarks. Noticing
preliminary inadequacies in those benchmarks, we embarked on a study to
critically assess 23 state-of-the-art LLM benchmarks, using our novel unified
evaluation framework through the lenses of people, process, and technology,
under the pillars of functionality and security. Our research uncovered
significant limitations, including biases, difficulties in measuring genuine
reasoning, adaptability, implementation inconsistencies, prompt engineering
complexity, evaluator diversity, and the overlooking of cultural and
ideological norms in one comprehensive assessment. Our discussions emphasized
the urgent need for standardized methodologies, regulatory certainties, and
ethical guidelines in light of Artificial Intelligence (AI) advancements,
including advocating for an evolution from static benchmarks to dynamic
behavioral profiling to accurately capture LLMs' complex behaviors and
potential risks. Our study highlighted the necessity for a paradigm shift in
LLM evaluation methodologies, underlining the importance of collaborative
efforts for the development of universally accepted benchmarks and the
enhancement of AI systems' integration into society.

通过以人、过程和技术为视角，功能性和安全性为支柱，使用我们的统一评估框架，对 23 个最先进的 LLM 基准进行了研究，发现了显著的限制，并强调了在人工智能进步的背景下，标准化方法、监管确定性和伦理指南的迫切需求，以及通过协作努力发展被广泛接受的基准和增强人工智能系统融入社会的重要性。