BriefGPT.xyz
Jun, 2024
数据有效的强化学习高阶函数的典型奖励网络
Prototypical Reward Network for Data-Efficient RLHF
HTML
PDF
Jinghan Zhang, Xiting Wang, Yiqiao Jin, Changyu Chen, Xinhao Zhang...
TL;DR
利用Proto-RM框架来增强在受限制的人类反馈条件下的奖励模型和优化语言模型的微调,显著提高了适应性和准确性,并且在数据受限场景中比传统方法要求更少的数据。
Abstract
The reward model for
reinforcement learning
from
human feedback
(RLHF) has proven effective in fine-tuning
large language models
(LLMs). N
→