The contextual linear bandit is an important online learning problem where
given arm features, a learning agent selects an arm at each round to maximize
the cumulative rewards in the long run. A line of works, called the clustering
of bandits (CB), utilize the collaborative effect over user preferences and
have shown significant improvements over classic linear bandit algorithms.
However, existing CB algorithms require well-specified linear user models and
can fail when this critical assumption does not hold. Whether robust CB
algorithms can be designed for more practical scenarios with misspecified user
models remains an open problem. In this paper, we are the first to present the
important problem of clustering of bandits with misspecified user models
(CBMUM), where the expected rewards in user models can be perturbed away from
perfect linear models. We devise two robust CB algorithms, RCLUMB and RSCLUMB
(representing the learned clustering structure with dynamic graph and sets,
respectively), that can accommodate the inaccurate user preference estimations
and erroneous clustering caused by model misspecifications. We prove regret
upper bounds of $O(\epsilon_*T\sqrt{md\log T} + d\sqrt{mT}\log T)$ for our
algorithms under milder assumptions than previous CB works (notably, we move
past a restrictive technical assumption on the distribution of the arms), which
match the lower bound asymptotically in $T$ up to logarithmic factors, and also
match the state-of-the-art results in several degenerate cases. The techniques
in proving the regret caused by misclustering users are quite general and may
be of independent interest. Experiments on both synthetic and real-world data
show our outperformance over previous algorithms.

提出了聚类多臂老虎机在用户模型未正确规定的情况下的问题，设计了两种鲁棒性算法，能适应不准确的用户偏好评估和模型错误导致的聚类问题，证明了我们算法的遗憾上限。实验证明我们对之前算法的优越性。

在线聚类误指定用户模型的赌博机

Online Clustering of Bandits with Misspecified User Models

The rapid proliferation of decentralized learning systems mandates the need
for differentially-private cooperative learning. In this paper, we study this
in context of the contextual linear bandit: we consider a collection of agents
cooperating to solve a common contextual bandit, while ensuring that their
communication remains private. For this problem, we devise \textsc{FedUCB}, a
multiagent private algorithm for both centralized and decentralized
(peer-to-peer) federated learning. We provide a rigorous technical analysis of
its utility in terms of regret, improving several results in cooperative bandit
learning, and provide rigorous privacy guarantees as well. Our algorithms
provide competitive performance both in terms of pseudoregret bounds and
empirical benchmark performance in various multi-agent settings.

本文就基于上下文线性赌博机的联邦学习问题提出了一种称为 FedUCB 的多代理私有算法，该算法在中央化和去中心化（点对点）联邦学习方案中均可使用，在保证通信隐私的同时，在后遗憾度和隐私保证方面表现出极强的实用性。

差分隐私联合线性赌博机算法

Differentially-Private Federated Linear Bandits

We study a constrained contextual linear bandit setting, where the goal of
the agent is to produce a sequence of policies, whose expected cumulative
reward over the course of $T$ rounds is maximum, and each has an expected cost
below a certain threshold $\tau$. We propose an upper-confidence bound
algorithm for this problem, called optimistic pessimistic linear bandit (OPLB),
and prove an $\widetilde{\mathcal{O}}(\frac{d\sqrt{T}}{\tau-c_0})$ bound on its
$T$-round regret, where the denominator is the difference between the
constraint threshold and the cost of a known feasible action. We further
specialize our results to multi-armed bandits and propose a computationally
efficient algorithm for this setting. We prove a regret bound of
$\widetilde{\mathcal{O}}(\frac{\sqrt{KT}}{\tau - c_0})$ for this algorithm in
$K$-armed bandits, which is a $\sqrt{K}$ improvement over the regret bound we
obtain by simply casting multi-armed bandits as an instance of contextual
linear bandits and using the regret bound of OPLB. We also prove a lower-bound
for the problem studied in the paper and provide simulations to validate our
theoretical results.

本文研究了一个约束的上下文线性赌博机问题，提出了一种算法 OPLB 并证明了其 T 轮后悔度的上限，针对多臂赌博机情况提出了高效算法，同时给出了问题的下限和模拟结果。