Recent advancements in large language models (LLMs) have led to their increased application across various tasks, with reinforcement learning from human feedback (RLHF) being a crucial part of their training to align responses with user intentions. In the RLHF process, a reward model is trained using responses preferences determined by human labelers or AI systems, which then refines the LLM through reinforcement learning. This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance. Weak supervision employs noisy or imprecise data labeling, reducing reliance on expensive manually labeled data. By analyzing RLHF datasets to identify heuristics that correlate with response preference, we wrote simple labeling functions and then calibrated a label model to weakly annotate unlabeled data. Our evaluation show that while weak supervision significantly benefits smaller datasets by improving reward model performance, its effectiveness decreases with larger, originally labeled datasets. Additionally, using an LLM to generate and then weakly label responses offers a promising method for extending preference data.

本研究解决了奖励模型训练中对人工标注数据依赖过大的问题。通过引入弱监督的方法，利用噪声或不精确的数据标注，研究人员能够扩展RLHF数据集并提升奖励模型的性能。研究表明，虽然弱监督在小型数据集上显著提高了奖励模型的表现，但在大型数据集上效果减弱，同时利用大型语言模型生成和弱标注响应的方法也展示了扩展偏好数据的潜力。

利用弱监督进行语言模型的奖励建模