We investigate bandit convex optimization (BCO) with delayed feedback, where
only the loss value of the action is revealed under an arbitrary delay.
Previous studies have established a regret bound of $O(T^{3/4}+d^{1/3}T^{2/3})$
for this problem, where $d$ is the maximum delay, by simply feeding delayed
loss values to the classical bandit gradient descent (BGD) algorithm. In this
paper, we develop a novel algorithm to enhance the regret, which carefully
exploits the delayed bandit feedback via a blocking update mechanism. Our
analysis first reveals that the proposed algorithm can decouple the joint
effect of the delays and bandit feedback on the regret, and improve the regret
bound to $O(T^{3/4}+\sqrt{dT})$ for convex functions. Compared with the
previous result, our regret matches the $O(T^{3/4})$ regret of BGD in the
non-delayed setting for a larger amount of delay, i.e., $d=O(\sqrt{T})$,
instead of $d=O(T^{1/4})$. Furthermore, we consider the case with strongly
convex functions, and prove that the proposed algorithm can enjoy a better
regret bound of $O(T^{2/3}\log^{1/3}T+d\log T)$. Finally, we show that in a
special case with unconstrained action sets, it can be simply extended to
achieve a regret bound of $O(\sqrt{T\log T}+d\log T)$ for strongly convex and
smooth functions.

我们研究了具有延迟反馈的强凸波段优化问题，通过精细地利用延迟波段反馈的阻塞更新机制，我们的算法改进了损失边界并将其与延迟设置下的传统波段梯度下降（BGD）算法相匹配。

带有延迟反馈的强化学习优化中的改进后悔度

Improved Regret for Bandit Convex Optimization with Delayed Feedback

We derive a new analysis of Follow The Regularized Leader (FTRL) for online
learning with delayed bandit feedback. By separating the cost of delayed
feedback from that of bandit feedback, our analysis allows us to obtain new
results in three important settings. On the one hand, we derive the first
optimal (up to logarithmic factors) regret bounds for combinatorial
semi-bandits with delay and adversarial Markov decision processes with delay
(and known transition functions). On the other hand, we use our analysis to
derive an efficient algorithm for linear bandits with delay achieving
near-optimal regret bounds. Our novel regret decomposition shows that FTRL
remains stable across multiple rounds under mild assumptions on the Hessian of
the regularizer.

本文提出了 Follow The Regularized Leader (FTRL) 算法并应用于在线学习中，通过分离延迟反馈成本和赌博反馈成本，得出了在三种不同的情况下的新结果，包括组合半赌博、带延迟的对抗 Markov 决策过程以及带权值的线性赌博。我们的新型遗憾分解显示 FTRL 在正则化程序的 Hessian 矩阵上的温和假设下，可在多轮中保持稳定，并为线性赌徒提供了一种有效算法和接近最优的遗憾限制。

组合半匪谷、线性匪谷和 MDP 的非随机延迟反馈的统一分析

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial  Semi-Bandits, Linear Bandits, and MDPs

We consider regret minimization for Adversarial Markov Decision Processes
(AMDPs), where the loss functions are changing over time and adversarially
chosen, and the learner only observes the losses for the visited state-action
pairs (i.e., bandit feedback). While there has been a surge of studies on this
problem using Online-Mirror-Descent (OMD) methods, very little is known about
the Follow-the-Perturbed-Leader (FTPL) methods, which are usually
computationally more efficient and also easier to implement since it only
requires solving an offline planning problem. Motivated by this, we take a
closer look at FTPL for learning AMDPs, starting from the standard episodic
finite-horizon setting. We find some unique and intriguing difficulties in the
analysis and propose a workaround to eventually show that FTPL is also able to
achieve near-optimal regret bounds in this case. More importantly, we then find
two significant applications: First, the analysis of FTPL turns out to be
readily generalizable to delayed bandit feedback with order-optimal regret,
while OMD methods exhibit extra difficulties (Jin et al., 2022). Second, using
FTPL, we also develop the first no-regret algorithm for learning communicating
AMDPs in the infinite-horizon setting with bandit feedback and stochastic
transitions. Our algorithm is efficient assuming access to an offline planning
oracle, while even for the easier full-information setting, the only existing
algorithm (Chandrasekaran and Tewari, 2021) is computationally inefficient.

通过研究 Follow-the-Perturbed-Leader 算法在 Adversarial Markov Decision Processes 中的应用，作者发现该算法不仅在有限时间内能够实现近似最优的 regret bound，并且能够有序地处理 Delayed Bandit Feedback 问题，并且进一步提出了第一个无悔学习算法来解决在无限时间内、使用有限的 bandit feedback 和随机转移的情况下解决 AMDPs 问题。