BriefGPT.xyz
Oct, 2024
基于人类反馈的强化学习中的双重主动学习
Dual Active Learning for Reinforcement Learning from Human Feedback
HTML
PDF
Pangpang Liu, Chengchun Shi, Will Wei Sun
TL;DR
本研究解决了从人类反馈中学习奖励函数的效率问题,提出了一种双重主动奖励学习算法,能够同时选择对话和教师以提高数据质量。通过利用悲观强化学习和自适应选择策略,理论上证明了所获得的奖励估计器具有最小的推广方差,并在模拟实验中显示了该算法相较于现有技术的优越性。
Abstract
Aligning
Large Language Models
(LLMs) with human preferences is critical to recent advances in generative artificial intelligence.
Reinforcement Learning
from
→