The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards. In this paper, we propose a hybrid alignment framework HaF-RM for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level. Theoretical justifications and experiment results on five datasets show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model. By decoupling the reward modeling procedure and incorporating hybrid supervision, our HaF-RM framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at https://haf-rm.github.io.

通过引入对令牌级别策略概率的额外约束来训练奖励模型的混合对齐框架（HaF-RM）能同时监督令牌级别的内部首选模型并优化奖励模型的映射层，通过解耦奖励建模过程并结合混合监督，我们的HaF-RM框架为增强奖励模型的性能和对齐提供了一种有原则和有效的方法。

HAF-RM：一种用于奖励模型训练的混合对齐框架