We consider multi-armed bandit with distributed players, where each player independently samples one of N stochastic processes with unknown parameters and accrues reward in each slot without information exchange. Users choosing the same arm collide, and none or only one receives reward depending on the collision model. This problem can be formulated as a decentralized multi-armed bandit problem. We measure the performance of a decentralized policy by the system regret, defined as the total reward loss with respect to the optimal performance under the perfect scenario where all arm parameters are known to all users and collisions among users are eliminated through perfect scheduling. We show that the minimum system regret grows with time at the same logarithmic order as in the centralized counterpart, where users exchange observations and make decisions jointly. A decentralized policy is constructed to achieve this optimal order. Furthermore, we show that the proposed policy belongs to a general class of decentralized polices, for which a uniform performance benchmark is established.

本文研究了一种分散式多臂搏击器的问题，提出了一种达到最优秩序并确保公平性的分散式政策，并证明了其总遗憾增长速率的下限，这个问题在认知无线电网络，多通道通信系统，多智能体系统，网络搜索和广告以及社交网络等领域有潜在的应用。

多人多臂赌博机的分布式学习