We consider the contextual version of a multi-armed bandit problem with global convex constraints and concave objective function. In each round, the outcome of pulling an arm is a context-dependent vector, and the global constraints require the average of these vectors to lie in a certain convex set. The objective is a concave function of this average vector. The learning agent competes with an arbitrary set of context-dependent policies. This problem is a common generalization of problems considered by Badanidiyuru et al. (2014) and Agrawal and Devanur (2014), with important applications. We give computationally efficient algorithms with near-optimal regret, generalizing the approach of Agarwal et al. (2014) for the non-constrained version of the problem. For the special case of budget constraints our regret bounds match those of Badanidiyuru et al. (2014), answering their main open question of obtaining a computationally efficient algorithm.

研究了具有全局背包限制条件下的上下文多臂赌博问题，提出了一种计算效率更高、后悔更低的算法，复杂度与策略空间的大小成对数关系，并将结果推广到一种没有背包限制但目标是任意Lipschitz凹函数的变体。

一种高效的含背包限制多臂赌博算法，以及对凹目标问题的扩展