The ability of CodeLLMs to generate executable and functionally correct code at the \textit{repository-level scale }remains largely unexplored. We introduce \methodnamews, a novel benchmark for evaluating code generation at the repository-level scale, emphasizing executability and correctness. \methodnamews provides an automated system that verifies requirements and incorporates a mechanism for dynamically generating high-coverage test cases to assess the functionality of generated code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuning models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. \methodnamews aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios.

CodeLLMs在仓库级别规模上生成可执行且功能正确的代码的能力尚未得到广泛探索。我们引入了一种新的评估代码生成在仓库级别规模上的基准，名为methodnamews，强调可执行性和正确性。methodnamews提供了一个自动化系统，用于验证要求，并具有动态生成高覆盖率测试用例的机制，以评估生成代码的功能性。我们的研究探索了一个受控场景，开发人员在其中指定必要的代码依赖项，挑战模型准确地集成这些依赖项。实验证明，尽管预训练的LLMs在正确性方面胜过指令调优模型，但后者在利用所提供的依赖项和展示调试能力方面表现出色。methodnamews旨在提供对代码功能和与开发人员意图的一致性的全面评估，为在实际场景中使用更可靠和适用的CodeLLMs铺平道路。

REPOEXEC: 通过代码库级别的可执行基准评估代码生成