We derive a new analysis of Follow The Regularized Leader (FTRL) for online
learning with delayed bandit feedback. By separating the cost of delayed
feedback from that of bandit feedback, our analysis allows us to obtain new
results in three important settings. On the one hand, we derive the first
optimal (up to logarithmic factors) regret bounds for combinatorial
semi-bandits with delay and adversarial Markov decision processes with delay
(and known transition functions). On the other hand, we use our analysis to
derive an efficient algorithm for linear bandits with delay achieving
near-optimal regret bounds. Our novel regret decomposition shows that FTRL
remains stable across multiple rounds under mild assumptions on the Hessian of
the regularizer.

本文提出了 Follow The Regularized Leader (FTRL) 算法并应用于在线学习中，通过分离延迟反馈成本和赌博反馈成本，得出了在三种不同的情况下的新结果，包括组合半赌博、带延迟的对抗 Markov 决策过程以及带权值的线性赌博。我们的新型遗憾分解显示 FTRL 在正则化程序的 Hessian 矩阵上的温和假设下，可在多轮中保持稳定，并为线性赌徒提供了一种有效算法和接近最优的遗憾限制。

组合半匪谷、线性匪谷和 MDP 的非随机延迟反馈的统一分析

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial  Semi-Bandits, Linear Bandits, and MDPs

In this paper we investigate the Follow the Regularized Leader dynamics in
sequential imperfect information games (IIG). We generalize existing results of
Poincar\'e recurrence from normal-form games to zero-sum two-player imperfect
information games and other sequential game settings. We then investigate how
adapting the reward (by adding a regularization term) of the game can give
strong convergence guarantees in monotone games. We continue by showing how
this reward adaptation technique can be leveraged to build algorithms that
converge exactly to the Nash equilibrium. Finally, we show how these insights
can be directly used to build state-of-the-art model-free algorithms for
zero-sum two-player Imperfect Information Games (IIG).

研究了在顺序不完美信息游戏中遵循规则的领导者动态，推广了 Poincaré 循环结果，并探讨了通过调整奖励来建立收敛保证的技术，进而构建了精确收敛到 Nash 平衡的算法，为零和二人不完美信息游戏的无模型算法提供了新思路。

从庞加莱回归到不完全信息博弈的收敛：通过正则化寻找均衡

From Poincaré Recurrence to Convergence in Imperfect Information  Games: Finding Equilibrium via Regularization

We design and analyze algorithms for online linear optimization that have
optimal regret and at the same time do not need to know any upper or lower
bounds on the norm of the loss vectors. Our algorithms are instances of the
Follow the Regularized Leader (FTRL) and Mirror Descent (MD) meta-algorithms.
We achieve adaptiveness to the norms of the loss vectors by scale invariance,
i.e., our algorithms make exactly the same decisions if the sequence of loss
vectors is multiplied by any positive constant. The algorithm based on FTRL
works for any decision set, bounded or unbounded. For unbounded decisions sets,
this is the first adaptive algorithm for online linear optimization with a
non-vacuous regret bound. In contrast, we show lower bounds on scale-free
algorithms based on MD on unbounded domains.

本文设计并分析了一种不需要任何上限或下限的在线线性优化算法，实现了适应损失向量范数的缩放不变性，并且通过 FTRL 和 MD 元算法实现了最优遗憾，并为无界决策集开发了一种非真空遗憾绑定的自适应算法，并对基于 MD 的无标度算法在无界域上的下限进行了研究。