Large language models (LLMs) have demonstrated remarkable performance in the legal domain, with GPT-4 even passing the Uniform Bar Exam in the U.S. However their efficacy remains limited for non-standardized tasks and tasks in languages other than English. This underscores the need for careful evaluation of LLMs within each legal system before application. Here, we introduce KBL, a benchmark for assessing the Korean legal language understanding of LLMs, consisting of (1) 7 legal knowledge tasks (510 examples), (2) 4 legal reasoning tasks (288 examples), and (3) the Korean bar exam (4 domains, 53 tasks, 2,510 examples). First two datasets were developed in close collaboration with lawyers to evaluate LLMs in practical scenarios in a certified manner. Furthermore, considering legal practitioners' frequent use of extensive legal documents for research, we assess LLMs in both a closed book setting, where they rely solely on internal knowledge, and a retrieval-augmented generation (RAG) setting, using a corpus of Korean statutes and precedents. The results indicate substantial room and opportunities for improvement.

本文致力于解决当前大型语言模型在非标准化任务和非英语语言任务中的评估不足问题。我们提出了KBL基准，专门评估大型语言模型对韩国法律语言的理解，包含多个法律知识和推理任务，以及韩国律师考试的样本。研究显示，当前模型在法律语言理解方面仍有显著提升空间，强调了进一步优化的必要性。

开发评估韩国法律语言理解的大型语言模型的实用基准