We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards. Using a fast inference procedure with Polya-Gamma distributed augmentation variables, we propose an improved version of Thompson Sampling, a Bayesian formulation of contextual bandits with near-optimal performance. Our approach, Polya-Gamma augmented Thompson Sampling (PG-TS), achieves state-of-the-art performance on simulated and real data. PG-TS explores the action space efficiently and exploits high-reward arms, quickly converging to solutions of low regret. Its explicit estimation of the posterior distribution of the context feature covariance leads to substantial empirical gains over approximate approaches. PG-TS is the first approach to demonstrate the benefits of Polya-Gamma augmentation in bandits and to propose an efficient Gibbs sampler for approximating the analytically unsolvable integral of logistic contextual bandits.

本文提出了改进的Polya-Gamma配分的Thompson Sampling算法（PG-TS），通过使用一种快速推理程序，它可以解决逻辑上下文bandits的遗憾最小化问题，通过对环境特征协方差的后验分布的明确估计，能够使得PG-TS在类似情形下较传统算法快速收敛。

PG-TS：逻辑上下文多臂赌博机的改进汤普森抽样