We consider the discrete-time infinite-horizon average-reward restless bandit
problem. We propose a novel policy that maintains two dynamic subsets of arms:
one subset of arms has a nearly optimal state distribution and takes actions
according to an Optimal Local Control routine; the other subset of arms is
driven towards the optimal state distribution and gradually merged into the
first subset. We show that our policy is asymptotically optimal with an
$O(\exp(-C N))$ optimality gap for an $N$-armed problem, under the mild
assumptions of aperiodic-unichain, non-degeneracy, and local stability. Our
policy is the first to achieve exponential asymptotic optimality under the
above set of easy-to-verify assumptions, whereas prior work either requires a
strong Global Attractor assumption or only achieves an $O(1/\sqrt{N})$
optimality gap. We further discuss the fundamental obstacles in significantly
weakening our assumptions. In particular, we prove a lower bound showing that
local stability is fundamental for exponential asymptotic optimality.

我们提出了一种新的策略，该策略通过维护两个动态武器子集来解决离散时间无限视界平均奖励不安定强盗问题，其中一个子集具有近乎最优的状态分布并根据最优局部控制例程采取行动；另一个子集被驱向最优状态分布并逐渐合并到第一个子集中。我们证明了我们的策略在满足周期性 - 单链、非退化性和局部稳定性等温和假设的情况下在 N 臂问题中是渐进最优的，并且具有 O (exp (-C N)) 的最优性差距。我们的策略是首个在上述易于验证的假设集下实现指数渐近最优性的方法，而先前的工作要么需要强全局吸引子假设，要么仅实现了 O (1/sqrt (N)) 的最优性差距。我们进一步讨论了在显著减弱假设的基础上面临的基本障碍。特别地，我们通过证明一个下界，证明了局部稳定性对于指数渐近最优性是必要的。