We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks. The benchmark first defines the specification generation problem into a program synthesis problem within a confined scope of syntax and semantics by providing LLMs with the programming model. The LLMs are required to understand the provided verification assumption and the potential syntax and semantics space to search for, then generate the complete specification for the potentially buggy operating system code implementation under the guidance of the high-level functional description of the operating system. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each is a long context task of about 20k-30k tokens. Our comprehensive evaluation of 12 LLMs exhibits the limited performance of the current LLMs on the specification generation tasks for operating system verification. Significant disparities in their performance on the benchmark highlight differences in their ability to handle long-context code generation tasks. The evaluation toolkit and benchmark are available at https://github.com/lishangyu-hkust/OSVBench.

本研究提出了OSVBench，一个用于评估大型语言模型在生成与操作系统内核验证相关的完整规范代码的新基准。研究展示了当前大型语言模型在操作系统验证的规范生成任务中表现有限，揭示了它们在处理长上下文代码生成任务时的能力差异，从而为未来的研究提供了重要的改进方向。

OSVBench：操作系统验证中的规范生成任务对大型语言模型的基准测试