We introduce a novel multi-armed bandit framework, where each arm is
associated with a fixed unknown credal set over the space of outcomes (which
can be richer than just the reward). The arm-to-credal-set correspondence comes
from a known class of hypotheses. We then define a notion of regret
corresponding to the lower prevision defined by these credal sets.
Equivalently, the setting can be regarded as a two-player zero-sum game, where,
on each round, the agent chooses an arm and the adversary chooses the
distribution over outcomes from a set of options associated with this arm. The
regret is defined with respect to the value of game. For certain natural
hypothesis classes, loosely analgous to stochastic linear bandits (which are a
special case of the resulting setting), we propose an algorithm and prove a
corresponding upper bound on regret. We also prove lower bounds on regret for
particular special cases.

我们引入了一种新颖的多臂赌博问题框架，其中每个臂与一个固定的未知置信集相关联，覆盖了结果空间（可以比奖励更丰富）。臂 - 置信集对应关系来自已知的假设类。我们定义了一种与这些置信集定义的下概率相对应的遗憾概念。等价地，这个设置可以被视为一个两人零和博弈，其中在每一轮中，代理选择一个臂，对手从与该臂相关联的选择集中选择结果分布。遗憾是相对于游戏价值定义的。对于某些自然的假设类，这些类类似于随机线性赌博问题（是结果设置的特殊情况），我们提出了一个算法并证明了遗憾的上界。我们还证明了特定特殊情况下的遗憾下界。