This paper studies the fixed-confidence best arm identification (BAI) problem
in the bandit framework in the canonical single-parameter exponential models.
For this problem, many policies have been proposed, but most of them require
solving an optimization problem at every round and/or are forced to explore an
arm at least a certain number of times except those restricted to the Gaussian
model. To address these limitations, we propose a novel policy that combines
Thompson sampling with a computationally efficient approach known as the best
challenger rule. While Thompson sampling was originally considered for
maximizing the cumulative reward, we demonstrate that it can be used to
naturally explore arms in BAI without forcing it. We show that our policy is
asymptotically optimal for any two-armed bandit problems and achieves near
optimality for general $K$-armed bandit problems for $K\geq 3$. Nevertheless,
in numerical experiments, our policy shows competitive performance compared to
asymptotically optimal policies in terms of sample complexity while requiring
less computation cost. In addition, we highlight the advantages of our policy
by comparing it to the concept of $\beta$-optimality, a relaxed notion of
asymptotic optimality commonly considered in the analysis of a class of
policies including the proposed one.

该论文研究了在集中置信度下的最佳臂识别问题，提出了一种结合汤普森采样和最佳挑战者规则的策略，在样本复杂度较低的情况下取得了近乎最优的性能。

最佳挑战规则下的贝叶斯臂选择中的汤姆森探索

Thompson Exploration with Best Challenger Rule in Best Arm  Identification

In this work, we initiate the idea of using denoising diffusion models to
learn priors for online decision making problems. Our special focus is on the
meta-learning for bandit framework, with the goal of learning a strategy that
performs well across bandit tasks of a same class. To this end, we train a
diffusion model that learns the underlying task distribution and combine
Thompson sampling with the learned prior to deal with new tasks at test time.
Our posterior sampling algorithm is designed to carefully balance between the
learned prior and the noisy observations that come from the learner's
interaction with the environment. To capture realistic bandit scenarios, we
also propose a novel diffusion model training procedure that trains even from
incomplete and/or noisy data, which could be of independent interest. Finally,
our extensive experimental evaluations clearly demonstrate the potential of the
proposed approach.

本文提出使用去噪扩散模型来学习在线决策问题的先验知识，并结合 Thompson 抽样和先前学习到的先验知识来处理新任务，实现了跨同一类 Bandit 任务表现良好的元学习策略。使用后验抽样算法来平衡先验和与来自环境的噪音观测。通过广泛的实验验证了所提出方法的潜力。