This paper investigates the adversarial Bandits with Knapsack (BwK) online learning problem, where a player repeatedly chooses to perform an action, pays the corresponding cost, and receives a reward associated with the action. The player is constrained by the maximum budget $B$ that can be spent to perform actions, and the rewards and the costs of the actions are assigned by an adversary. This problem has only been studied in the restricted setting where the reward of an action is greater than the cost of the action, while we provide a solution in the general setting. Namely, we propose EXP3.BwK, a novel algorithm that achieves order optimal regret. We also propose EXP3++.BwK, which is order optimal in the adversarial BwK setup, and incurs an almost optimal expected regret with an additional factor of $\log(B)$ in the stochastic BwK setup. Finally, we investigate the case of having large costs for the actions (i.e., they are comparable to the budget size $B$), and show that for the adversarial setting, achievable regret bounds can be significantly worse, compared to the case of having costs bounded by a constant, which is a common assumption within the BwK literature.

本文研究了在预算限制下的拟背包问题下应用 EXP3.BwK 算法解决对抗性赌徒问题，提出了在线学习方案并给出了相应的后悔界。研究表明，当动作成本与预算大小相当时，可实现的后悔界可能会极差，相比于成本受限的情况。

统一随机和对抗性赌博机与背包问题