We propose KMMLU, a new Korean benchmark with 35,030 expert-level
multiple-choice questions across 45 subjects ranging from humanities to STEM.
Unlike previous Korean benchmarks that are translated from existing English
benchmarks, KMMLU is collected from original Korean exams, capturing linguistic
and cultural aspects of the Korean language. We test 26 publically available
and proprietary LLMs, identifying significant room for improvement. The best
publicly available model achieves 50.54% on KMMLU, far below the average human
performance of 62.6%. This model was primarily trained for English and Chinese,
not Korean. Current LLMs tailored to Korean, such as Polyglot-Ko, perform far
worse. Surprisingly, even the most capable proprietary LLMs, e.g., GPT-4 and
HyperCLOVA X, achieve 59.95% and 53.40%, respectively. This suggests that
further work is needed to improve Korean LLMs, and KMMLU offers the right tool
to track this progress. We make our dataset publicly available on the Hugging
Face Hub and integrate the benchmark into EleutherAI's Language Model
Evaluation Harness.

我们提出了 KMMLU，这是一个新的韩语基准，包括来自 45 个学科的 35,030 个专家级多项选择题，涵盖人文学科到 STEM 学科。与之前从现有英语基准翻译而来的韩语基准不同，KMMLU 收集了来自原始韩语考试的问题，捕捉了韩语的语言和文化方面。我们测试了 26 个公开和专有 LLM 模型，发现有显著的改进空间。最好的公开模型在 KMMLU 上的准确率为 50.54％，远远低于人类平均表现 62.6％。该模型主要用于英文和中文训练，而不是韩语。对于韩语，当前的适用 LLMs，例如 Polyglot-Ko，表现得更差。令人惊讶的是，即使是最强大的专有 LLMs，例如 GPT-4 和 HyperCLOVA X，分别只能达到 59.95％和 53.40％。这表明需要进一步改进韩语 LLMs，而 KMMLU 提供了追踪这一进展的正确工具。我们在 Hugging Face Hub 上公开了我们的数据集，并将这个基准整合到 EleutherAI 的语言模型评估工具中。