We study the stochastic multi-armed bandit problem when one knows the value
$\mu^{(\star)}$ of an optimal arm, as a well as a positive lower bound on the
smallest positive gap $\Delta$. We propose a new randomized policy that attains
a regret {\em uniformly bounded over time} in this setting. We also prove
several lower bounds, which show in particular that bounded regret is not
possible if one only knows $\Delta$, and bounded regret of order $1/\Delta$ is
not possible if one only knows $\mu^{(\star)}$

研究解决在已知最优的选择和最小间隔值时如何制定随机化策略，以解决随机多臂赌博问题中可能发生的后悔问题，并探讨了其下界和最优解等问题。

随机多臂赌博机的有限遗憾

Bounded regret in stochastic multi-armed bandits

We use Markov risk measures to formulate a risk-averse version of the
undiscounted total cost problem for a transient controlled Markov process. We
derive risk-averse dynamic programming equations and we show that a randomized
policy may be strictly better than deterministic policies, when risk measures
are employed. We illustrate the results on an optimal stopping problem and an
organ transplant problem.

利用马尔科夫风险度量来制定风险规避版本的马尔可夫过程的总成本问题，得出风险规避动态规划方程，并证明当使用风险度量时，随机策略可能比确定性策略更好。最后利用一个最优停止问题和器官移植问题来说明结果。