We consider the problem of \emph{blocked} collaborative bandits where there
are multiple users, each with an associated multi-armed bandit problem. These
users are grouped into \emph{latent} clusters such that the mean reward vectors
of users within the same cluster are identical. Our goal is to design
algorithms that maximize the cumulative reward accrued by all the users over
time, under the \emph{constraint} that no arm of a user is pulled more than
$\mathsf{B}$ times. This problem has been originally considered by
\cite{Bresler:2014}, and designing regret-optimal algorithms for it has since
remained an open problem. In this work, we propose an algorithm called
\texttt{B-LATTICE} (Blocked Latent bAndiTs via maTrIx ComplEtion) that
collaborates across users, while simultaneously satisfying the budget
constraints, to maximize their cumulative rewards. Theoretically, under certain
reasonable assumptions on the latent structure, with $\mathsf{M}$ users,
$\mathsf{N}$ arms, $\mathsf{T}$ rounds per user, and $\mathsf{C}=O(1)$ latent
clusters, \texttt{B-LATTICE} achieves a per-user regret of
$\widetilde{O}(\sqrt{\mathsf{T}(1 + \mathsf{N}\mathsf{M}^{-1})}$ under a budget
constraint of $\mathsf{B}=\Theta(\log \mathsf{T})$. These are the first
sub-linear regret bounds for this problem, and match the minimax regret bounds
when $\mathsf{B}=\mathsf{T}$. Empirically, we demonstrate that our algorithm
has superior performance over baselines even when $\mathsf{B}=1$.
\texttt{B-LATTICE} runs in phases where in each phase it clusters users into
groups and collaborates across users within a group to quickly learn their
reward models.

设计了一个叫做 B-LATTICE（通过矩阵完成的被阻塞潜在臂选择的协作性乐透机制）的算法，通过满足预算限制并在用户之间进行协作，以最大化他们的累积奖励。在理论上，满足合理的潜在结构假设，对于具有 M 个用户，N 个臂，每个用户 T 轮和 C=O (1) 个潜在类别的问题，B-LATTICE 在预算约束为 B=O (logT) 的条件下，实现了每个用户的尽量减小后悔为 O (√(T (1+N/M)))。这是该问题的首个次线性后悔上界，当 B=T 时与极小后悔上界相匹配。实证上，我们证明了即使在 B=1 时，我们的算法也具有优越的性能。

基于每个项目预算约束的在线协同过滤：阻塞协同强盗

Blocked Collaborative Bandits: Online Collaborative Filtering with  Per-Item Budget Constraints

The contextual bandit problem is a theoretically justified framework with
wide applications in various fields. While the previous study on this problem
usually requires independence between noise and contexts, our work considers a
more sensible setting where the noise becomes a latent confounder that affects
both contexts and rewards. Such a confounded setting is more realistic and
could expand to a broader range of applications. However, the unresolved
confounder will cause a bias in reward function estimation and thus lead to a
large regret. To deal with the challenges brought by the confounder, we apply
the dual instrumental variable regression, which can correctly identify the
true reward function. We prove the convergence rate of this method is
near-optimal in two types of widely used reproducing kernel Hilbert spaces.
Therefore, we can design computationally efficient and regret-optimal
algorithms based on the theoretical guarantees for confounded bandit problems.
The numerical results illustrate the efficacy of our proposed algorithms in the
confounded bandit setting.

本论文中，我们解决了在 contextual bandit 问题中噪声被 confounder 影响的问题，引入了潜在的 confounder，并且应用了双重工具变量回归来解决 reward function 估计中的偏差问题，设计出基于理论保障的计算效率高且 regret-optimal 的算法。