AbstractWe address the problem of
online sequential decision making, i.e., balancing the trade-off between exploiting the current knowledge to maximize immediate performance and exploring the new information to gain long-term benefits using the multi-armed bandit framework.
→