In a multi-armed bandit (MAB) problem a gambler needs to choose at each round
of play one of K arms, each characterized by an unknown reward distribution.
Reward realizations are only observed when an arm is selected, and the
gambler's objective is to maximize his cumulative expected earnings over some
given horizon of play T. To do this, the gambler needs t