Motivated by the consideration of fairly sharing the cost of exploration between multiple groups in learning problems, we develop the Nash bargaining solution in the context of multi-armed bandits. Specifically, the 'grouped' bandit associated with any multi-armed bandit problem associates, with each time step, a single group from some finite set of groups. The utility gained by a given group under some learning policy is naturally viewed as the reduction in that group's regret relative to the regret that group would have incurred 'on its own'. We derive policies that yield the Nash bargaining solution relative to the set of incremental utilities possible under any policy. We show that on the one hand, the 'price of fairness' under such policies is limited, while on the other hand, regret optimal policies are arbitrarily unfair under generic conditions. Our theoretical development is complemented by a case study on contextual bandits for warfarin dosing where we are concerned with the cost of exploration across multiple races and age groups.

研究了在在线学习中探索成本如何跨越多个组之间分摊，并提出了一种“分组”赌博模型，利用公理谈判和纳什谈判解来形式化地划分探索成本，并通过创造性的方法推导了平衡公平和探索成本的策略。以华法林剂量的情境赌博为例说明了此算法框架的相对优点。

公平探索的公理谈判