We derive an alternative proof for the regret of Thompson sampling (\ts) in the stochastic linear bandit setting. While we obtain a regret bound of order $\widetilde{O}(d^{3/2}\sqrt{T})$ as in previous results, the proof sheds new light on the functioning of the \ts. We leverage on the structure of the problem to show how the regret is related to the sensitivity (i.e., the gradient) of the objective function and how selecting optimal arms associated to \textit{optimistic} parameters does control it. Thus we show that \ts can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach. Furthermore, we show that our proof can be readily applied to regularized linear optimization and generalized linear model problems.

在随机线性赌博机问题中，我们为Thompson采样的后悔证明提供了一种替代证明方法。我们展示了后悔与目标函数的敏感性有关，并且选取与乐观参数相关的最优臂可以控制后悔，在具有固定概率为乐观的采样分布下来看，Thompson采样可以作为一种通用的随机化算法。我们还证明了这个理论可以轻松应用到正则化线性优化和广义线性模型问题中。

线性汤普森抽样再探