线性参数赌博机的近似极小极大后悔

Mar, 2019

线性参数赌博机的近似极小极大后悔

Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits

Yingkai Li, Yining Wang, Yuan Zhou

TL;DR研究了有限动作集的线性上下文强化学习问题，介绍了一种名为VCL SupLinUCB的算法，并表明其与最佳下界相匹配，相较于之前的算法分析，节省了两个对数因子。

Abstract

We study the linear contextual bandit problem with finite action sets. When the problem dimension is $d$, the time horizon is $T$, and there are $n \leq 2^{d/2}$ candidate actions per time period, we (1) show that the minimax expected regret is $\Omega(\sqrt{dT \log T \log n})$ for eve