Online task scheduling serves an integral role for task-intensive
applications in cloud computing and crowdsourcing. Optimal scheduling can
enhance system performance, typically measured by the reward-to-cost ratio,
under some task arrival distribution. On one hand, both reward and cost are
dependent on task context (e.g., evaluation metric) and remain black-box in
practice. These render reward and cost hard to model thus unknown before
decision making. On the other hand, task arrival behaviors remain sensitive to
factors like unpredictable system fluctuation whereby a prior estimation or the
conventional assumption of arrival distribution (e.g., Poisson) may fail. This
implies another practical yet often neglected challenge, i.e., uncertain task
arrival distribution. Towards effective scheduling under a stationary
environment with various uncertainties, we propose a double-optimistic learning
based Robbins-Monro (DOL-RM) algorithm. Specifically, DOL-RM integrates a
learning module that incorporates optimistic estimation for reward-to-cost
ratio and a decision module that utilizes the Robbins-Monro method to
implicitly learn task arrival distribution while making scheduling decisions.
Theoretically, DOL-RM achieves convergence gap and no regret learning with a
sub-linear regret of $O(T^{3/4})$, which is the first result for online task
scheduling under uncertain task arrival distribution and unknown reward and
cost. Our numerical results in a synthetic experiment and a real-world
application demonstrate the effectiveness of DOL-RM in achieving the best
cumulative reward-to-cost ratio compared with other state-of-the-art baselines.

提出了一种基于双重乐观学习的 Robbins-Monro 算法来解决在线任务调度中不确定任务到达分布和未知奖励与成本问题。通过在决策过程中利用乐观估计奖励与成本比例和 Robbins-Monro 方法隐式学习任务到达分布，DOL-RM 算法在不同不确定性环境下实现了有效的调度，并取得了比其他先进基准方法更好的累积奖励与成本比。

学习使用 Bandit 反馈调度在线任务

Learning to Schedule Online Tasks with Bandit Feedback

We propose a new method to improve the convergence speed of the Robbins-Monro
algorithm by introducing prior information about the target point into the
Robbins-Monro iteration. We achieve the incorporation of prior information
without the need of a -- potentially wrong -- regression model, which would
also entail additional constraints. We show that this prior-information
Robbins-Monro sequence is convergent for a wide range of prior distributions,
even wrong ones, such as Gaussian, weighted sum of Gaussians, e.g., in a kernel
density estimate, as well as bounded arbitrary distribution functions greater
than zero. We furthermore analyse the sequence numerically to understand its
performance and the influence of parameters. The results demonstrate that the
prior-information Robbins-Monro sequence converges faster than the standard
one, especially during the first steps, which are particularly important for
applications where the number of function measurements is limited, and when the
noise of observing the underlying function is large. We finally propose a rule
to select the parameters of the sequence.

提出一种利用先验信息改善 Robbins-Monro 算法收敛速度的新方法，不需要使用潜在错误的回归模型，且适用于各种先验分布，尤其在测量函数次数有限和观测到的噪声较大的应用中，该先验信息 Robbins-Monro 序列比标准序列收敛更快，并提出了选择序列参数的规则。