We consider the problem of designing contextual bandit algorithms in the ``cross-learning'' setting of Balseiro et al., where the learner observes the loss for the action they play in all possible contexts, not just the context of the current round. We specifically consider the setting where losses are chosen adversarially and contexts are sampled i.i.d. from an unknown distribution. In this setting, we resolve an open problem of Balseiro et al. by providing an efficient algorithm with a nearly tight (up to logarithmic factors) regret bound of $\widetilde{O}(\sqrt{TK})$, independent of the number of contexts. As a consequence, we obtain the first nearly tight regret bounds for the problems of learning to bid in first-price auctions (under unknown value distributions) and sleeping bandits with a stochastic action set. At the core of our algorithm is a novel technique for coordinating the execution of a learning algorithm over multiple epochs in such a way to remove correlations between estimation of the unknown distribution and the actions played by the algorithm. This technique may be of independent interest for other learning problems involving estimation of an unknown context distribution.

在文中，我们解决了Balseiro等人提出的“交叉学习”设置中的上下文强盗算法设计问题，通过提供一个高效算法，其拥有几乎紧密（除对数因子外）的减悔上界O（TK），与上下文数量无关。作为结果，我们得到了对于在未知值分布下学习进行首价拍卖出价和具有随机行动集合的睡眠强盗问题的几乎紧密减悔上界。我们的算法核心是一种协调学习算法在多个时期的执行的新技术，以消除对于未知分布的估计和算法执行的动作之间的相关性。这种技术对于涉及对未知上下文分布进行估计的其他学习问题可能具有独立的意义。

未知上下文分布的上下文强化学习的最优交叉学习