We study online meta-learning with bandit feedback, with the goal of
improving performance across multiple tasks if they are similar according to
some natural similarity measure. As the first to target the adversarial
online-within-online partial-information setting, we design meta-algorithms
that combine outer learners to simultaneously tune the initialization and other
hyperparameters of an inner learner for two important cases: multi-armed
bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-learners
initialize and set hyperparameters of the Tsallis-entropy generalization of
Exp3, with the task-averaged regret improving if the entropy of the
optima-in-hindsight is small. For BLO, we learn to initialize and tune online
mirror descent (OMD) with self-concordant barrier regularizers, showing that
task-averaged regret varies directly with an action space-dependent measure
they induce. Our guarantees rely on proving that unregularized
follow-the-leader combined with two levels of low-dimensional hyperparameter
tuning is enough to learn a sequence of affine functions of non-Lipschitz and
sometimes non-convex Bregman divergences bounding the regret of OMD.

该论文研究了具有 bandit feedback 的在线元学习，目的是通过某种自然的相似性度量改善类似的多个任务的性能。

元学习对抗强盗算法

Meta-Learning Adversarial Bandit Algorithms

We design differentially private algorithms for the problem of online linear
optimization in the full information and bandit settings with optimal
$\tilde{O}(\sqrt{T})$ regret bounds. In the full-information setting, our
results demonstrate that $\epsilon$-differential privacy may be ensured for
free -- in particular, the regret bounds scale as
$O(\sqrt{T})+\tilde{O}\left(\frac{1}{\epsilon}\right)$. For bandit linear
optimization, and as a special case, for non-stochastic multi-armed bandits,
the proposed algorithm achieves a regret of
$\tilde{O}\left(\frac{1}{\epsilon}\sqrt{T}\right)$, while the previously known
best regret bound was
$\tilde{O}\left(\frac{1}{\epsilon}T^{\frac{2}{3}}\right)$.

本文提出了一种确保差分隐私的在线线性优化算法，其完全信息情况下的后果与 epsilon 无关，但在轮盘线性优化和非随机多臂匪徒的情况下，其遗憾上限是一个 $	ilde {O}$ 函数，同时使时间复杂度在 $\tilde {O}(\frac {1}{\epsilon}\sqrt {T}))$ 内。