Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state of the art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze Thompson Sampling algorithm for the contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is perhaps the most important and widely studied version of the contextual bandits problem. We prove a high probability regret bound of $\tilde{O}(\frac{1}{\sqrt{\epsilon}}\sqrt {T^{1+\epsilon}} d)$ in time $T$ for any $0<\epsilon <1$, where $d$ is the dimension of each context vector and $\epsilon$ is a parameter used by the algorithm. Our results provide the first theoretical guarantees for the contextual version of Thompson Sampling, and are close to the lower bound of $\Omega(\sqrt{Td})$ for this problem. This essentially solves the COLT open problem of Chapelle and Li [COLT 2012] regarding regret bounds for Thompson Sampling for contextual bandits problem. Our version of Thompson sampling uses Gaussian prior and Gaussian likelihood function. Our novel martingale-based analysis techniques also allow easy extensions to the use of more general distributions, satisfying certain general conditions.

本文设计和分析了一种基于贝叶斯思想的Thompson Sampling算法泛化版本，用于解决带有线性收益函数的随机上下文多臂老虎机问题，同时提供了该算法的第一理论保证，得到了最佳遗憾保证。

基于线性回报的情境型贝叶斯-汤普森抽样算法