Most of the existing federated multi-armed bandits (FMAB) designs are based
on the presumption that clients will implement the specified design to
collaborate with the server. In reality, however, it may not be possible to
modify the client's existing protocols. To address this challenge, this work
focuses on clients who always maximize their individual cumulative rewards, and
introduces a novel idea of "reward teaching", where the server guides the
clients towards global optimality through implicit local reward adjustments.
Under this framework, the server faces two tightly coupled tasks of bandit
learning and target teaching, whose combination is non-trivial and challenging.
A phased approach, called Teaching-After-Learning (TAL), is first designed to
encourage and discourage clients' explorations separately. General performance
analyses of TAL are established when the clients' strategies satisfy certain
mild requirements. With novel technical approaches developed to analyze the
warm-start behaviors of bandit algorithms, particularized guarantees of TAL
with clients running UCB or epsilon-greedy strategies are then obtained. These
results demonstrate that TAL achieves logarithmic regrets while only incurring
logarithmic adjustment costs, which is order-optimal w.r.t. a natural lower
bound. As a further extension, the Teaching-While-Learning (TWL) algorithm is
developed with the idea of successive arm elimination to break the non-adaptive
phase separation in TAL. Rigorous analyses demonstrate that when facing clients
with UCB1, TWL outperforms TAL in terms of the dependencies on sub-optimality
gaps thanks to its adaptive design. Experimental results demonstrate the
effectiveness and generality of the proposed algorithms.

本文提出了一种名为奖励教学的新颖概念，其中服务器通过隐式本地奖励调整来指导客户端向全局最优性靠拢。对于客户端无法修改现有协议的情况，作者提出了一种名为 Teaching-After-Learning（TAL) 的逐步方法，并通过开发技术方法分析了 TAL 的特定保证。在此基础上，提出了一种名为 Teaching-While-Learning（TWL) 的算法，其通过连续臂消除的思想打破了 TAL 中的非自适应分离，实验结果证明了该算法的有效性和广泛性。

联邦多臂赌博机的奖励训练

Reward Teaching for Federated Multi-armed Bandits

Federated multi-armed bandits (FMAB) is a new bandit paradigm that parallels
the federated learning (FL) framework in supervised learning. It is inspired by
practical applications in cognitive radio and recommender systems, and enjoys
features that are analogous to FL. This paper proposes a general framework of
FMAB and then studies two specific federated bandit models. We first study the
approximate model where the heterogeneous local models are random realizations
of the global model from an unknown distribution. This model introduces a new
uncertainty of client sampling, as the global model may not be reliably learned
even if the finite local models are perfectly known. Furthermore, this
uncertainty cannot be quantified a priori without knowledge of the
suboptimality gap. We solve the approximate model by proposing Federated Double
UCB (Fed2-UCB), which constructs a novel "double UCB" principle accounting for
uncertainties from both arm and client sampling. We show that gradually
admitting new clients is critical in achieving an O(log(T)) regret while
explicitly considering the communication cost. The exact model, where the
global bandit model is the exact average of heterogeneous local models, is then
studied as a special case. We show that, somewhat surprisingly, the
order-optimal regret can be achieved independent of the number of clients with
a careful choice of the update periodicity. Experiments using both synthetic
and real-world datasets corroborate the theoretical analysis and demonstrate
the effectiveness and efficiency of the proposed algorithms.

这篇论文提出了联邦多臂老虎机的新模型，并研究了两个具体的联邦多臂老虎机模型，提出了 Federated Double UCB 方法用于解决两个模型，理论与实验结果表明了该算法的效果和高效性。