BriefGPT.xyz
Jul, 2023
基于核函数的离线上下文对抗波动策略
Kernelized Offline Contextual Dueling Bandits
HTML
PDF
Viraj Mehta, Ojash Neopane, Vikramjeet Das, Sen Lin, Jeff Schneider...
TL;DR
在这项研究中,我们利用代理人能够选择获取人工反馈的上下文的事实,引入了离线情境对决贝叶斯臂设置,提出了一种基于上置信界的算法,并证明了一种遗憾上界。实验证实了该方法胜过使用均匀采样上下文的类似策略。
Abstract
preference-based feedback
is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in
reinforcement learning
from human feedback on large la
→