We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select $m \geq 2$ arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative $T$-round regret of MiDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$. Moreover, we prove a lower bound of $\Omega(K^{1/3} T^{2/3})$ for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.

对抗性多对决赌博机中的后悔最小化问题进行了介绍，并引入了一种新算法MiDEX（Multi Dueling EXP3）来学习来自成对子集选择模型的偏好反馈。证明了MiDEX相对于从K个臂中选择Borda赢家的累计T轮后悔的期望上界为O((KlogK)^{1/3}T^{2/3})，同时证明了在该设置下预期后悔的下界为Ω(K^{1/3}T^{2/3})，表明我们提出的算法是接近最优的。

对抗性多路决斗者