In this paper, we consider a very general model for exploration-exploitation tradeoff which allows arbitrary concave rewards and convex constraints on the decisions across time, in addition to the customary limitation on the time horizon. This model subsumes the classic multi-armed bandit (MAB) model, and the Bandits with Knapsacks (BwK) model of Badanidiyuru et al.[2013]. We also consider an extension of this model to allow linear contexts, similar to the linear contextual extension of the MAB model. We demonstrate that a natural and simple extension of the UCB family of algorithms for MAB provides a polynomial time algorithm that has near-optimal regret guarantees for this substantially more general model, and matches the bounds provided by Badanidiyuru et al.[2013] for the special case of BwK, which is quite surprising. We also provide computationally more efficient algorithms by establishing interesting connections between this problem and other well studied problems/algorithms such as the Blackwell approachability problem, online convex optimization, and the Frank-Wolfe technique for convex optimization. We give examples of several concrete applications, where this more general model of bandits allows for richer and/or more efficient formulations of the problem.

在这篇论文中，我们提出了一种广义的勘探-开发权衡模型，该模型允许在时间序列上对任意凹奖励和凸度约束进行决策，并对时间范围进行规定。我们证明了一种用于MAB的UCB系列算法自然而简单的扩展，提供了一个具有近乎最优的后悔保证的多项式时间算法，满足Badanidiyuru等人给出的BwK特殊情况下的边界，这一点非常惊人。此外，我们还通过建立此问题与其他研究领域中好的算法之间的有趣联系，提供了更高效的算法。

具有凹奖励和凸背包的赌博机