We study online meta-learning with bandit feedback, with the goal of
improving performance across multiple tasks if they are similar according to
some natural similarity measure. As the first to target the adversarial
online-within-online partial-information setting, we design meta-algorithms
that combine outer learners to simultaneously tune the initialization and other
hyperparameters of an inner learner for two important cases: multi-armed
bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-learners
initialize and set hyperparameters of the Tsallis-entropy generalization of
Exp3, with the task-averaged regret improving if the entropy of the
optima-in-hindsight is small. For BLO, we learn to initialize and tune online
mirror descent (OMD) with self-concordant barrier regularizers, showing that
task-averaged regret varies directly with an action space-dependent measure
they induce. Our guarantees rely on proving that unregularized
follow-the-leader combined with two levels of low-dimensional hyperparameter
tuning is enough to learn a sequence of affine functions of non-Lipschitz and
sometimes non-convex Bregman divergences bounding the regret of OMD.

该论文研究了具有 bandit feedback 的在线元学习，目的是通过某种自然的相似性度量改善类似的多个任务的性能。