Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by proposing Decoding Objectives for Code Execution, a comprehensive framework that includes candidate generation, $n$-best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components. We then study the contributions of these components through execution-based evaluation metrics. Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods. Furthermore, we assess the impact of filtering based on trial unit tests, a simple and effective strategy that has been often overlooked in prior works. We also propose self-debugging on multiple candidates, obtaining state-of-the-art performance on reranking for code generation. We expect our framework to provide a solid guideline for future research on code generation.

本研究解决了现有LLM代码生成方法缺乏综合比较框架的问题，提出了一套包含候选生成、n-best 重排名、最小贝叶斯风险解码和自我调试的综合框架。研究结果强调了基于执行的方法的重要性，并展示了通过单元测试过滤的简单有效策略对提升代码生成性能的影响。

DOCE：基于执行的代码生成中的最佳执行点