Risk awareness is an important feature to formulate a variety of real world problems. In this paper we study a multi-arm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level {\alpha} of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce the first Thompson Sampling approaches for CVaR bandits. Building on a recent work by Riou and Honda (2020), we propose {\alpha}-NPTS for bounded rewards and {\alpha}-Multinomial-TS for multinomial distributions. We provide a novel lower bound on the CVaR regret which extends the concept of asymptotic optimality to CVaR bandits and prove that {\alpha}-Multinomial-TS is the first algorithm to achieve this lower bound. Finally, we demonstrate empirically the benefit of Thompson Sampling approaches over their UCB counterparts.

本文研究一种多臂赌博机问题，其中每个臂的质量是在奖励分布的某个水平alpha上通过条件风险价值（CVaR）来测量。我们引入了一种新的CVaR赌博机定理的Thompson Sampling方法，尤其适用于基于物理资源的问题。我们在理论上提供了它们CVaR损失的最小化性能的可行性分析，实验结果表明这些策略是第一个在CVaR赌博机中实现渐近最优性的，并匹配了此设置的相应渐近下限。

支持感知CVaR赌博机的最优汤普森抽样策略