BriefGPT.xyz
Feb, 2012
可预测奖励的情境决策学习
Contextual Bandit Learning with Predictable Rewards
HTML
PDF
Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, Robert E. Schapire
TL;DR
本研究探讨了一种基于可实现性假设下的上下文强化学习问题,并提出了一种新算法——回归器消除,证明了其在保证可实现性前提下,也具有与不可实现性假设情况相似的遗憾率;同时在任意一组策略的情况下,我们证明了本算法具有恒定遗憾,相对于之前的方法而言。
Abstract
contextual bandit learning
is a
reinforcement learning
problem where the learner repeatedly receives a set of features (context), takes an action and receives a reward based on the action and context. We consider
→