Code verification has recently found great success as a critical component in training large scale reasoning models for coding. Synthetic techniques such as self-generated test cases and reward models provide a way to enhance code capabilities beyond predefined tests. Building on these advancements, we propose new benchmarks designed to systematically evaluate the impact of synthetic verification methods on assessing solution correctness. We introduce HE-R, HE-R+, MBPP-R, and MBPP-R+, which transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. Using these benchmarks, we analyze synthetic verification methods in standard, reasoning-based, and reward-based LLMs. Our results show that recent reasoning models significantly improve test case generation and that scaling test cases enhances verification accuracy.

本研究解决了当前代码验证方法在评估解决方案正确性方面的不足，提出了一套新的基准以系统性评估合成验证方法的影响。研究发现，现代推理模型在测试用例生成方面显著改善，同时扩大测试用例规模可提高验证准确性，预示着合成验证在代码能力提升中的重要潜力。

评分验证器：评估代码和推理中的合成验证