We introduce the safe linear stochastic bandit framework---a generalization of linear stochastic bandits---where, in each stage, the learner is required to select an arm with an expected reward that is no less than a predetermined (safe) threshold with high probability. We assume that the learner initially has knowledge of an arm that is known to be safe, but not necessarily optimal. Leveraging on this assumption, we introduce a learning algorithm that systematically combines known safe arms with exploratory arms to safely expand the set of safe arms over time, while facilitating safe greedy exploitation in subsequent stages. In addition to ensuring the satisfaction of the safety constraint at every stage of play, the proposed algorithm is shown to exhibit an expected regret that is no more than $O(\sqrt{T}\log (T))$ after $T$ stages of play.

本文介绍了一个安全的线性随机挑战模型，其中学习器在每一阶段都需要选择一个预期奖励不小于预先确定的（安全）阈值的臂，以高概率。我们假设学习器最初掌握的是一个已知为安全但不一定最优的臂的知识。基于此假设，介绍了一种学习算法，它将已知的安全臂与探索性臂系统地结合起来，以便随时间安全地扩展安全臂集，同时促进后续阶段的安全贪婪利用。除了确保在每个播放阶段满足安全约束之外，所提出的算法还表现出一种预期的遗憾，在播放T个阶段后不超过O（sqrt（T）log（T））

安全线性随机赌博机