In a conventional contextual multi-armed bandit problem, the feedback (or
reward) is immediately observable after an action. Nevertheless, delayed
feedback arises in numerous real-life situations and is particularly crucial in
time-sensitive applications. The exploration-exploitation dilemma becomes
particularly challenging under such conditions, as it couples with the
interplay between delays and limited resources. Besides, a limited budget often
aggravates the problem by restricting the exploration potential. A motivating
example is the distribution of medical supplies at the early stage of COVID-19.
The delayed feedback of testing results, thus insufficient information for
learning, degraded the efficiency of resource allocation. Motivated by such
applications, we study the effect of delayed feedback on constrained contextual
bandits. We develop a decision-making policy, delay-oriented resource
allocation with learning (DORAL), to optimize the resource expenditure in a
contextual multi-armed bandit problem with arm-dependent delayed feedback.

在有限的资源和延迟反馈的情况下，研究了延迟反馈对约束上下文多臂赌博问题的影响，并开发了一种决策策略（DORAL），以优化资源在具有依赖延迟反馈的上下文多臂赌博问题中的使用。

基于延迟反馈的预算推荐

Budgeted Recommendation with Delayed Feedback

We study contextual bandits with budget and time constraints, referred to as
constrained contextual bandits.The time and budget constraints significantly
complicate the exploration and exploitation tradeoff because they introduce
complex coupling among contexts over time.Such coupling effects make it
difficult to obtain oracle solutions that assume known statistics of bandits.
To gain insight, we first study unit-cost systems with known context
distribution. When the expected rewards are known, we develop an approximation
of the oracle, referred to Adaptive-Linear-Programming (ALP), which achieves
near-optimality and only requires the ordering of expected rewards. With these
highly desirable features, we then combine ALP with the upper-confidence-bound
(UCB) method in the general case where the expected rewards are unknown {\it a
priori}. We show that the proposed UCB-ALP algorithm achieves logarithmic
regret except for certain boundary cases. Further, we design algorithms and
obtain similar regret analysis results for more general systems with unknown
context distribution and heterogeneous costs. To the best of our knowledge,
this is the first work that shows how to achieve logarithmic regret in
constrained contextual bandits. Moreover, this work also sheds light on the
study of computationally efficient algorithms for general constrained
contextual bandits.

本文对具有预算和时间限制的约束情境赌博问题展开了研究，提出了一种高效算法 UCB-ALP 以实现对其进行近似求解并达到对数遗憾。