BriefGPT.xyz
Apr, 2017
时间敏感型贝叶斯优化多臂赌博机学习
Time-Sensitive Bandit Learning and Satisficing Thompson Sampling
HTML
PDF
Daniel Russo, David Tse, Benjamin Van Roy
TL;DR
该文研究了在具有时间偏好的情况下的强化学习中,使用折扣累计损失代替累计损失,使用改进的 Thompson 抽样算法得到较强的解决方案。
Abstract
The literature on
bandit learning
and
regret analysis
has focused on contexts where the goal is to converge on an optimal action in a manner that limits exploration costs. One shortcoming imposed by this orientat
→