时间敏感型贝叶斯优化多臂赌博机学习

Apr, 2017

时间敏感型贝叶斯优化多臂赌博机学习

Time-Sensitive Bandit Learning and Satisficing Thompson Sampling

Daniel Russo, David Tse, Benjamin Van Roy

TL;DR该文研究了在具有时间偏好的情况下的强化学习中，使用折扣累计损失代替累计损失，使用改进的 Thompson 抽样算法得到较强的解决方案。

Abstract

The literature on bandit learning and regret analysis has focused on contexts where the goal is to converge on an optimal action in a manner that limits exploration costs. One shortcoming imposed by this orientat