We introduce Syntax-Aware Fill-In-the-Middle (SAFIM), a new benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. This benchmark focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. SAFIM provides a robust framework with various prompt designs and novel syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. Our comprehensive evaluation of 15 LLMs shows that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs. Our findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM thus serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The evaluation toolkit and dataset are available at https://github.com/gonglinyuan/safim, and the leaderboard is available at https://safimbenchmark.com.

我们引入了Syntax-Aware Fill-In-the-Middle (SAFIM)，这是一个新的基准，用于评估大型语言模型 (LLMs) 在代码填充任务中的性能。该基准集中在对程序结构进行语法感知的补全，如代码块和条件表达式，并包括来自多种编程语言的17,720个示例，这些示例源自近期的代码提交，旨在最小化数据污染。SAFIM提供了一个强大的框架，具有各种提示设计和新颖的语法感知后处理技术，有助于在LLMs之间进行准确和公平的比较。我们对15个LLMs的全面评估表明，FIM预训练不仅增强了FIM的能力，还改善了利用LLMs进行从左到右 (L2R) 推理的性能。我们的发现挑战了传统信念，并表明预训练方法和数据质量比模型规模更具影响力。因此，SAFIM成为未来在代码LLMs的有效预训练策略方面的研究基础平台。评估工具包和数据集可在此https URL获得，排行榜可在此https URL获得。

评估LLMs在句法感知的代码填充任务中的表现