We consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and a collision occurs if the same arm is pulled by more than one player; this is a standard model of cognitive radio networks. We construct a decentralized algorithm that achieves the same performances as a centralized one, if players are synchronized and observe their collisions. We actually construct a communication protocol between players by enforcing willingly collisions, allowing them to share their exploration. With a weaker feedback, when collisions are not observed, we still maintain some communication between players but at the cost of some extra multiplicative term in the regret. We also prove that the logarithmic growth of the regret is still achievable in the dynamic case where players are not synchronized with each other, thus preventing communication. Finally, we prove that if all players follow naively the celebrated UCB algorithm, the total regret grows linearly.

通过构建一种通信协议，使多个玩家之间出现冲突以便以极低成本共享信息的方式，我们提出了一种分散式算法，可实现与集中式一样的性能，以解决基于认知无线电网络的随机多人多臂赌博问题；当通信协议不能实现时，我们介绍了更适当的动态设置，并基于新算法证明了该模型仍可实现对数性后悔的增长。

SIC-MMAB: 多人多臂赌博机中涉及通讯的同步