We propose a generalization of the best arm identification problem in stochastic multi-armed bandits (MAB) to the setting where every pull of an arm is associated with delayed feedback. The delay in feedback increases the effective sample complexity of standard algorithms, but can be offset if we have access to partial feedback received before a pull is completed. We propose a general framework to model the relationship between partial and delayed feedback, and as a special case we introduce efficient algorithms for settings where the partial feedback are biased or unbiased estimators of the delayed feedback. Additionally, we propose a novel extension of the algorithms to the parallel MAB setting where an agent can control a batch of arms. Our experiments in real-world settings, involving policy search and hyperparameter optimization in computational sustainability domains for fast charging of batteries and wildlife corridor construction, demonstrate that exploiting the structure of partial feedback can lead to significant improvements over baselines in both sequential and parallel MAB.

本文研究了在多臂赌博机的延迟反馈场景下，如何利用局部反馈来提高标准算法的样本复杂度。采用模型化的方法探讨了局部反馈和延迟反馈之间的关系，并提出了一种用于处理偏差或无偏差情况下局部反馈的有效算法。另外，还针对并行多臂赌博机提出了一种新的算法扩展。在实际场景中，针对电池快速充电和野生动物走廊建设的计算可持续性领域中的策略搜索和超参数优化等问题的实验表明，利用局部反馈的结构可以显著提高标准算法的性能。

多臂老虎机中带延迟反馈的最佳臂识别