We study the contextual continuum bandits problem, where the learner
sequentially receives a side information vector and has to choose an action in
a convex set, minimizing a function associated to the context. The goal is to
minimize all the underlying functions for the received contexts, leading to a
dynamic (contextual) notion of regret, which is stronger than the standard
static regret. Assuming that the objective functions are H\"older with respect
to the contexts, we demonstrate that any algorithm achieving a sub-linear
static regret can be extended to achieve a sub-linear dynamic regret. We
further study the case of strongly convex and smooth functions when the
observations are noisy. Inspired by the interior point method and employing
self-concordant barriers, we propose an algorithm achieving a sub-linear
dynamic regret. Lastly, we present a minimax lower bound, implying two key
facts. First, no algorithm can achieve sub-linear dynamic regret over functions
that are not continuous with respect to the context. Second, for strongly
convex and smooth functions, the algorithm that we propose achieves, up to a
logarithmic factor, the minimax optimal rate of dynamic regret as a function of
the number of queries.

我们研究了上下文连续性强化学习问题，证明了任何达到次线性静态遗憾的算法都可以扩展到达到次线性动态遗憾，我们提出了一种算法，通过自协调屏障和内点法实现了次线性动态遗憾，并且得出两个关键事实：首先，对于上下文不连续的函数，没有算法可以达到次线性动态遗憾；其次，对于强凸和光滑函数，我们提出的算法达到了最小极大动态遗憾速率的最优值，仅相差对数因子。

上下文连续型强化学习：静态对动态遗憾的比较

Contextual Continuum Bandits: Static Versus Dynamic Regret

In this paper we establish efficient and \emph{uncoupled} learning dynamics
so that, when employed by all players in a general-sum multiplayer game, the
\emph{swap regret} of each player after $T$ repetitions of the game is bounded
by $O(\log T)$, improving over the prior best bounds of $O(\log^4 (T))$. At the
same time, we guarantee optimal $O(\sqrt{T})$ swap regret in the adversarial
regime as well. To obtain these results, our primary contribution is to show
that when all players follow our dynamics with a \emph{time-invariant} learning
rate, the \emph{second-order path lengths} of the dynamics up to time $T$ are
bounded by $O(\log T)$, a fundamental property which could have further
implications beyond near-optimally bounding the (swap) regret. Our proposed
learning dynamics combine in a novel way \emph{optimistic} regularized learning
with the use of \emph{self-concordant barriers}. Further, our analysis is
remarkably simple, bypassing the cumbersome framework of higher-order
smoothness recently developed by Daskalakis, Fishelson, and Golowich
(NeurIPS'21).

本文通过使用具有时间不变学习率的乐观约束学习和自协调障碍，创新地组合学习动力学，成功地获得了广义和多人游戏中所有玩家的 swap regret，使每个玩家在 T 次游戏后都受到对数捆绑，同时在对抗性情形下保证了最佳的 sqrt (T) swap regret。