We study a challenging form of Smoothed Online Convex Optimization, a.k.a.
SOCO, including multi-step nonlinear switching costs and feedback delay. We
propose a novel machine learning (ML) augmented online algorithm,
Robustness-Constrained Learning (RCL), which combines untrusted ML predictions
with a trusted expert online algorithm via constrained projection to robustify
the ML prediction. Specifically,we prove that RCL is able to
guarantee$(1+\lambda)$-competitiveness against any given expert for
any$\lambda>0$, while also explicitly training the ML model in a
robustification-aware manner to improve the average-case performance.
Importantly,RCL is the first ML-augmented algorithm with a provable robustness
guarantee in the case of multi-step switching cost and feedback delay.We
demonstrate the improvement of RCL in both robustness and average performance
using battery management for electrifying transportationas a case study.

我们研究了一种具有多步非线性切换成本和反馈延迟的挑战性平滑在线凸优化（SOCO）形式，提出了一种新颖的机器学习（ML）增强的在线算法，名为 Robustness-Constrained Learning（RCL），它通过受限投影将不受信任的 ML 预测与可信的专家在线算法结合起来，以增强 ML 预测的鲁棒性。具体而言，我们证明了 RCL 能够对于任何给定的专家保证（1+λ）竞争力，其中 λ>0，同时以鲁棒性感知的方式明确地训练 ML 模型以提高平均性能。重要的是，RCL 是第一个在多步切换成本和反馈延迟情况下具有可证明的鲁棒性保证的 ML 增强算法。我们以电动交通的电池管理为案例研究，展示了 RCL 在鲁棒性和平均性能方面的改进。

具备反馈延迟的平滑在线凸优化的鲁棒学习

Robust Learning for Smoothed Online Convex Optimization with Feedback  Delay

Sequential decision-making under uncertainty is often associated with long
feedback delays. Such delays degrade the performance of the learning agent in
identifying a subset of arms with the optimal collective reward in the long
run. This problem becomes significantly challenging in a non-stationary
environment with structural dependencies amongst the reward distributions
associated with the arms. Therefore, besides adapting to delays and
environmental changes, learning the causal relations alleviates the adverse
effects of feedback delay on the decision-making process. We formalize the
described setting as a non-stationary and delayed combinatorial semi-bandit
problem with causally related rewards. We model the causal relations by a
directed graph in a stationary structural equation model. The agent maximizes
the long-term average payoff, defined as a linear function of the base arms'
rewards. We develop a policy that learns the structural dependencies from
delayed feedback and utilizes that to optimize the decision-making while
adapting to drifts. We prove a regret bound for the performance of the proposed
algorithm. Besides, we evaluate our method via numerical analysis using
synthetic and real-world datasets to detect the regions that contribute the
most to the spread of Covid-19 in Italy.

在不稳定环境中进行的顺序决策和反馈延迟引起的问题，通过学习因果关系来减轻决策过程中的不利影响，本文将此问题形式化为具有因果关联奖励的非平稳和延迟组合半强盗问题，并通过数值分析在意大利使用合成和真实世界数据集来检测对 Covid-19 传播最重要的地区。

非平稳时延组合半赌博问题与因果相关奖励

Non-stationary Delayed Combinatorial Semi-Bandit with Causally Related  Rewards

We address the problem of learning in an online setting where the learner
repeatedly observes features, selects among a set of actions, and receives
reward for the action taken. We provide the first efficient algorithm with an
optimal regret. Our algorithm uses a cost sensitive classification learner as
an oracle and has a running time $\mathrm{polylog}(N)$, where $N$ is the number
of classification rules among which the oracle might choose. This is
exponentially faster than all previous algorithms that achieve optimal regret
in this setting. Our formulation also enables us to create an algorithm with
regret that is additive rather than multiplicative in feedback delay as in all
previous work.

本文介绍了一种在线学习算法，它使用了一种基于代价敏感分类器的方法，并实现了最优遗憾率，与之前的算法相比，具有指数级别的运行速度优势，并且在反馈延迟方面实现了加性遗憾而非乘性遗憾。