Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive evaluation framework that reflects real-world applications. To address these gaps, we introduce CoCo-Bench (Comprehensive Code Benchmark), designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review. These dimensions capture essential developer needs, ensuring a more systematic and representative evaluation. CoCo-Bench includes multiple programming languages and varying task difficulties, with rigorous manual review to ensure data quality and accuracy. Empirical results show that CoCo-Bench aligns with existing benchmarks while uncovering significant variations in model performance, effectively highlighting strengths and weaknesses. By offering a holistic and objective evaluation, CoCo-Bench provides valuable insights to guide future research and technological advancements in code-oriented LLMs, establishing a reliable benchmark for the field.

本研究解决了现有基准在软件工程中缺乏全面评估框架的问题。论文提出了CoCo-Bench，它通过代码理解、生成、修改和审查四个维度综合评估大型语言模型，涵盖多种编程语言和任务难度。研究表明CoCo-Bench能够揭示模型表现的显著差异，为未来的代码导向大型语言模型研究提供了可靠的基准。

CoCo-Bench：多任务大型语言模型评估的综合代码基准