We study an extension of standard bandit problem in which there are R layers of experts. Multi-layered experts make selections layer by layer and only the experts in the last layer can play arms. The goal of the learning policy is to minimize the total regret in this hierarchical experts setting. We first analyze the case that total regret grows linearly with the number of layers. Then we focus on the case that all experts are playing Upper Confidence Bound (UCB) strategy and give several sub-linear upper bounds for different circumstances. Finally, we design some experiments to help the regret analysis for the general case of hierarchical UCB structure and show the practical significance of our theoretical results. This article gives many insights about reasonable hierarchical decision structure.

本文研究了一种扩展的标准赌博机问题，其中有 R 层专家。多层专家按层选择，只有最后一层的专家才能玩臂。学习策略的目标是在这种分层专家情况下，尽可能减少总遗憾。本文首先分析遗憾总数与层数线性增长的情况。然后，我们专注于所有专家都在进行 Upper Confidence Bound（UCB）策略的情况，并为不同情况给出多种次线性上限。最后，我们设计了一些实验，以帮助对分层 UCB 结构的遗憾分析，并展示了我们理论结果的实际意义。

层次专家赌博问题的遗憾分析