We propose a simple model selection approach for algorithms in stochastic
bandit and reinforcement learning problems. As opposed to prior work that
(implicitly) assumes knowledge of the optimal regret, we only require that each
base algorithm comes with a candidate regret bound that may or may not hold
during all rounds. In each round, our approach plays a base algorithm to keep
the candidate regret bounds of all remaining base algorithms balanced, and
eliminates algorithms that violate their candidate bound. We prove that the
total regret of this approach is bounded by the best valid candidate regret
bound times a multiplicative factor. This factor is reasonably small in several
applications, including linear bandits and MDPs with nested function classes,
linear bandits with unknown misspecification, and LinUCB applied to linear
bandits with different confidence parameters. We further show that, under a
suitable gap-assumption, this factor only scales with the number of base
algorithms and not their complexity when the number of rounds is large enough.
Finally, unlike recent efforts in model selection for linear stochastic
bandits, our approach is versatile enough to also cover cases where the context
information is generated by an adversarial environment, rather than a
stochastic one.

该文章提出了一种简单的模型选择方法，用于解决随机赌博和强化学习问题，并通过平衡算法的候选遗憾边界，以及淘汰违反其候选边界的算法来消除算法，从而证明该方法的总遗憾由最佳候选遗憾边界的一个乘性因子限制。