While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies key OOD issues, including step OOD, caused by differences in reasoning patterns across model types and sizes, and question OOD, which arises from dataset shifts between training data and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps as a warmup, enhancing PRM's ability to evaluate target steps and improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetrievalPRM model, establishing a new standard for PRM performance.

本研究针对过程奖励模型（PRMs）在处理分布外（OOD）挑战时存在的具体问题进行了探讨，包括推理步骤的OOD和问题的OOD。提出了一种新颖的增强检索过程奖励模型（RetrievalPRM），通过两阶段检索机制提升了PRM的通用性和推理一致性，实验结果表明该模型在多个真实数据集上表现优异，推动了PRM的性能标准。

增强检索过程奖励模型用于可推广的数学推理