Restless bandit problems assume time-varying reward distributions of the
arms, which adds flexibility to the model but makes the analysis more
challenging. We study learning algorithms over the unknown reward distributions
and prove a sub-linear, $O(\sqrt{T}\log T)$, regret bound for a variant of
Thompson sampling. Our analysis applies in the infinite time horizon setting,
resolving the open question raised by Jung and Tewari (2019) whose analysis is
limited to the episodic case. We adopt their policy mapping framework, which
allows our algorithm to be efficient and simultaneously keeps the regret
meaningful. Our algorithm adapts the TSDE algorithm of Ouyang et al. (2017) in
a non-trivial manner to account for the special structure of restless bandits.
We test our algorithm on a simulated dynamic channel access problem with
several policy mappings, and the empirical regrets agree with the theoretical
bound regardless of the choice of the policy mapping.

本文研究了在未知奖励分布下使用 Thompson 采样算法来解决不断变化的赌博机问题，证明了一种子线性的，O (sqrt (T) log T) 的遗憾上限，并将算法测试在了一个动态信道接入问题的模拟中，实证结果与理论上限一致。

非周期性不安定赌博机中的汤普森抽样

Thompson Sampling in Non-Episodic Restless Bandits

Restless bandit problems are instances of non-stationary multi-armed bandits.
These problems have been studied well from the optimization perspective, where
the goal is to efficiently find a near-optimal policy when system parameters
are known. However, very few papers adopt a learning perspective, where the
parameters are unknown. In this paper, we analyze the performance of Thompson
sampling in episodic restless bandits with unknown parameters. We consider a
general policy map to define our competitor and prove an
$\tilde{\mathcal{O}}(\sqrt{T})$ Bayesian regret bound. Our competitor is
flexible enough to represent various benchmarks including the best fixed action
policy, the optimal policy, the Whittle index policy, or the myopic policy. We
also present empirical results that support our theoretical findings.

本文从学习的角度分析了未知参数情况下的时序不息不静赌博机问题，在采用泰普斯抽样的情况下考虑了一个通用策略映射作为竞争者，证明了贝叶斯遗憾的 k 倍增长上限。本文的竞争对手足够灵活，可以表示各种基准，包括最佳固定操作策略，最优策略，惠特尔指数策略或近视策略。同时，还提供了支持理论发现的实证结果。