Inspired by the Reward-Biased Maximum Likelihood Estimate method of adaptive
control, we propose RBMLE -- a novel family of learning algorithms for
stochastic multi-armed bandits (SMABs). For a broad range of SMABs including
both the parametric Exponential Family as well as the non-parametric
sub-Gaussian/Exponential family, we show that RBMLE yields an index policy. To
choose the bias-growth rate $\alpha(t)$ in RBMLE, we reveal the nontrivial
interplay between $\alpha(t)$ and the regret bound that generally applies in
both the Exponential Family as well as the sub-Gaussian/Exponential family
bandits. To quantify the finite-time performance, we prove that RBMLE attains
order-optimality by adaptively estimating the unknown constants in the
expression of $\alpha(t)$ for Gaussian and sub-Gaussian bandits. Extensive
experiments demonstrate that the proposed RBMLE achieves empirical regret
performance competitive with the state-of-the-art methods, while being more
computationally efficient and scalable in comparison to the best-performing
ones among them.

RBMLE 算法是一种针对随机多臂赌博机问题的学习算法，以奖励偏差最大似然估计法为基础，可以得到基于指数策略的解，同时它还能够适应性地估计未知参数，并在实验中表现优异。