BriefGPT.xyz
Aug, 2020
多臂赌博机的宽容遗憾
Lenient Regret for Multi-Armed Bandits
HTML
PDF
Nadav Merlis, Shie Mannor
TL;DR
本文提出了一种忽略一定程度下最优性差距的Bandit算法,并以其为基础,设计优化算法Thompson Sampling(ε-TS)。研究结果表明,该算法能够在一定程度上避免过度探索问题,并在保证性能的前提下,提高计算效率。
Abstract
We consider the
multi-armed bandit
(MAB) problem, where the agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the
regret
. i.e., the
→