Motivated by distributed selection problems, we formulate a new variant of multi-player multi-armed bandit (MAB) model, which captures stochastic arrival of requests to each arm, as well as the policy of allocating requests to players. The challenge is how to design a distributed learning algorithm such that players select arms according to the optimal arm pulling profile (an arm pulling profile prescribes the number of players at each arm) without communicating to each other. We first design a greedy algorithm, which locates one of the optimal arm pulling profiles with a polynomial computational complexity. We also design an iterative distributed algorithm for players to commit to an optimal arm pulling profile with a constant number of rounds in expectation. We apply the explore then commit (ETC) framework to address the online setting when model parameters are unknown. We design an exploration strategy for players to estimate the optimal arm pulling profile. Since such estimates can be different across different players, it is challenging for players to commit. We then design an iterative distributed algorithm, which guarantees that players can arrive at a consensus on the optimal arm pulling profile in only M rounds. We conduct experiments to validate our algorithm.

本研究针对分布式选择问题，提出了一种新的多玩家多臂赌博机模型，解决了臂请求的随机到达及其分配策略。关键的创新在于设计出一种贪婪算法和迭代分布式算法，使得玩家无需通信即可根据最优臂拉取特征选取臂。实验结果表明，该算法能有效促使玩家在有限轮次内达成共识，具有重要的应用潜力。

多智能体随机可共享臂容量的多臂赌博机