Restless bandit problems are instances of non-stationary multi-armed bandits.
These problems have been studied well from the optimization perspective, where
the goal is to efficiently find a near-optimal policy when system parameters
are known. However, very few papers adopt a learning perspective, where the
parameters are unknown. In this paper, we analyze the performance of Thompson
sampling in episodic restless bandits with unknown parameters. We consider a
general policy map to define our competitor and prove an
$\tilde{\mathcal{O}}(\sqrt{T})$ Bayesian regret bound. Our competitor is
flexible enough to represent various benchmarks including the best fixed action
policy, the optimal policy, the Whittle index policy, or the myopic policy. We
also present empirical results that support our theoretical findings.

本文从学习的角度分析了未知参数情况下的时序不息不静赌博机问题，在采用泰普斯抽样的情况下考虑了一个通用策略映射作为竞争者，证明了贝叶斯遗憾的 k 倍增长上限。本文的竞争对手足够灵活，可以表示各种基准，包括最佳固定操作策略，最优策略，惠特尔指数策略或近视策略。同时，还提供了支持理论发现的实证结果。

具有时间性的多臂赌博机问题中汤普森抽样的遗憾界

Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems

In this paper, we consider the revealed preferences problem from a learning
perspective. Every day, a price vector and a budget is drawn from an unknown
distribution, and a rational agent buys his most preferred bundle according to
some unknown utility function, subject to the given prices and budget
constraint. We wish not only to find a utility function which rationalizes a
finite set of observations, but to produce a hypothesis valuation function
which accurately predicts the behavior of the agent in the future. We give
efficient algorithms with polynomial sample-complexity for agents with linear
valuation functions, as well as for agents with linearly separable, concave
valuation functions with bounded second derivative.

本文从学习的角度考虑了揭示偏好问题。对于具有线性评估函数以及具有线性可分、具有有界二阶导数的评估函数的代理，我们提供了具有多项式样本复杂度的有效算法。