Recent development of large language models (LLMs) for code like CodeX and CodeT5+ demonstrates tremendous promise in achieving code intelligence. Their ability of synthesizing code that completes a program for performing a pre-defined task has been intensively tested and verified on benchmark datasets including HumanEval and MBPP. Yet, evaluation of these LLMs from more perspectives (than just program synthesis) is also anticipated, considering their broad scope of applications in software engineering. In this paper, we explore the ability of LLMs for testing programs/code. By performing thorough analyses of recent LLMs for code in program testing, we show a series of intriguing properties of these models and demonstrate how program testing ability of LLMs can be improved. Following recent work which utilizes generated test cases to enhance program synthesis, we further leverage our findings in improving the quality of the synthesized programs and show +11.77% and +4.22% higher code pass rates on HumanEval+ comparing with the GPT-3.5-turbo baseline and the recent state-of-the-art, respectively.

利用对最近的大型语言模型进行了代码测试的详尽分析，本研究展示了这些模型的一系列有趣性质，并展示了如何改进大型语言模型的程序测试能力，通过利用生成的测试用例来提高合成程序的质量，相较于GPT-3.5-turbo和最新的最先进技术，我们的方法在HumanEval+上的代码通过率分别提高了11.77%和4.22%。

用于代码的大型语言模型的程序测试能力