We consider the infinite-horizon, average-reward restless bandit problem in
discrete time. We propose a new class of policies that are designed to drive a
progressively larger subset of arms toward the optimal distribution. We show
that our policies are asymptotically optimal with an $O(1/\sqrt{N})$ optimality
gap for an $N$-armed problem, provided that the single-armed relaxed problem is
unichain and aperiodic. Our approach departs from most existing work that
focuses on index or priority policies, which rely on the Uniform Global
Attractor Property (UGAP) to guarantee convergence to the optimum, or a
recently developed simulation-based policy, which requires a Synchronization
Assumption (SA).

我们研究了离散时间无限远平均回报的不安静赌博机问题，提出了一种新的策略类别，旨在将逐渐增大的一部分臂带向最优分布。我们证明了在 N 臂问题中，如果单臂松弛问题是单连通和非周期的，我们的策略是渐近最优的，具有 O (1/√N) 的最优性差距。与目前大多数关注索引或优先级策略，依靠统一全球吸引子属性（UGAP）以保证收敛到最优解的已有工作，或者最近开发的基于模拟的策略不同，我们的方法不需要同步假设（SA）。