BriefGPT.xyz
Jun, 2024
对抗性多路决斗者
Adversarial Multi-dueling Bandits
HTML
PDF
Pratik Gajane
TL;DR
对抗性多对决赌博机中的后悔最小化问题进行了介绍,并引入了一种新算法MiDEX(Multi Dueling EXP3)来学习来自成对子集选择模型的偏好反馈。证明了MiDEX相对于从K个臂中选择Borda赢家的累计T轮后悔的期望上界为O((KlogK)^{1/3}T^{2/3}),同时证明了在该设置下预期后悔的下界为Ω(K^{1/3}T^{2/3}),表明我们提出的算法是接近最优的。
Abstract
We introduce the problem of
regret minimization
in
adversarial multi-dueling bandits
. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In t
→