We study the problem of model selection in bandit scenarios in the presence of nested policy classes, with the goal of obtaining simultaneous adversarial and stochastic ("best of both worlds") high-probability regret guarantees. Our approach requires that each base learner comes with a candidate regret bound that may or may not hold, while our meta algorithm plays each base learner according to a schedule that keeps the base learner's candidate regret bounds balanced until they are detected to violate their guarantees. We develop careful mis-specification tests specifically designed to blend the above model selection criterion with the ability to leverage the (potentially benign) nature of the environment. We recover the model selection guarantees of the CORRAL algorithm for adversarial environments, but with the additional benefit of achieving high probability regret bounds, specifically in the case of nested adversarial linear bandits. More importantly, our model selection results also hold simultaneously in stochastic environments under gap assumptions. These are the first theoretical results that achieve best of both world (stochastic and adversarial) guarantees while performing model selection in (linear) bandit scenarios.

本文研究带有嵌套策略类别的赌场情境中的模型选择问题，旨在获得同时具备敌对和随机（“双赢”）的高概率遗憾保证。我们的方法要求每个基本学习器都带有可能或不可能持续的候选遗憾边界，同时，我们的元算法根据保持基本学习器的候选遗憾边界平衡的时间表播放每个基本学习器，直到它们被发现违反了保证。我们开发了谨慎的误规范测试，专门设计用于混合上述模型选择标准和利用环境（可能是良性）性质的能力。我们恢复了CORRAL算法在敌对环境下的模型选择保证，但在嵌套敌对线性赌徒的情况下，具有实现高概率遗憾边界的额外优势。更重要的是，我们的模型选择结果同时在间隙假设下在随机环境中保持。这些是第一个在（线性）赌徒情况下，在进行模型选择的情况下实现双赢（随机和敌对）保证的理论结果。

最佳模型选择