This paper studies the Bayesian regret of a variant of the Thompson-Sampling
algorithm for bandit problems. It builds upon the information-theoretic
framework of [Russo and Van Roy, 2015] and, more specifically, on the
rate-distortion analysis from [Dong and Van Roy, 2020], where they proved a
bound with regret rate of $O(d\sqrt{T \log(T)})$ for the $d$-dimensional linear
bandit setting. We focus on bandit problems with a metric action space and,
using a chaining argument, we establish new bounds that depend on the metric
entropy of the action space for a variant of Thompson-Sampling.
Under suitable continuity assumption of the rewards, our bound offers a tight
rate of $O(d\sqrt{T})$ for $d$-dimensional linear bandit problems.

该论文研究了贝叶斯后悔和汤普森抽样算法在赌博问题中的变体。它建立在信息论框架的基础上，通过率失真分析提供了关于线性赌博问题的后悔率上界。使用链接论证，我们针对度量动作空间的赌博问题建立了新的界限。在奖励的适当连续性假设下，我们的界限为 d 维线性赌博问题提供了紧凑的速率。