This paper studies multi-stage systems with end-to-end bandit feedback. In
such systems, each job needs to go through multiple stages, each managed by a
different agent, before generating an outcome. Each agent can only control its
own action and learn the final outcome of the job. It has neither knowledge nor
control on actions taken by agents in the next stage. The goal of this paper is
to develop distributed online learning algorithms that achieve sublinear regret
in adversarial environments.
The setting of this paper significantly expands the traditional multi-armed
bandit problem, which considers only one agent and one stage. In addition to
the exploration-exploitation dilemma in the traditional multi-armed bandit
problem, we show that the consideration of multiple stages introduces a third
component, education, where an agent needs to choose its actions to facilitate
the learning of agents in the next stage. To solve this newly introduced
exploration-exploitation-education trilemma, we propose a simple distributed
online learning algorithm, $\epsilon-$EXP3. We theoretically prove that the
$\epsilon-$EXP3 algorithm is a no-regret policy that achieves sublinear regret.
Simulation results show that the $\epsilon-$EXP3 algorithm significantly
outperforms existing no-regret online learning algorithms for the traditional
multi-armed bandit problem.

本文研究具有端到端赌博反馈的多阶段系统，提出了分布式在线学习算法，以在对抗环境中实现次线性遗憾。