This paper considers the use of a simple posterior sampling algorithm for handling the exploration-exploitation trade-off when learning to optimize actions such as in multi-armed bandit problems. The algorithm, also known as Thompson Sampling, offers significant potential advantages over other popular approaches, and can be naturally applied to problems with infinite action spaces and complicated relationships among the rewards generated by different actions. We provide an analysis of posterior sampling in a general framework, making two theoretical contributions. The first establishes a connection between posterior sampling and upper-confidence bound (UCB) algorithms. For specific classes of models, this result lets us convert regret bounds developed for specific UCB algorithms into Bayes risk bounds for posterior sampling. Our second theoretical contribution is a Bayes risk bound for posterior sampling that applies broadly and can be specialized to many model classes. This bound depends on a new notion of dimension that measures the degree of dependence among action rewards. We also present simulation results showing that posterior sampling significantly outperforms several recently proposed UCB algorithms.

本文采用一种简单的后验抽样算法来平衡探索和利用学习优化操作，称为 Thompson Sampling，理论上提出了后验抽样与 UCB 算法的联系，并提供了一个广泛适用且可以专门针对许多模型类进行特化的后验抽样贝叶斯遗憾界。

通过后验抽样学习优化