We study the problem of online multi-task learning where the tasks are performed within similar but not necessarily identical multi-armed bandit environments. In particular, we study how a learner can improve its overall performance across multiple related tasks through robust transfer of knowledge. While an upper confidence bound (UCB)-based algorithm has recently been shown to achieve nearly-optimal performance guarantees in a setting where all tasks are solved concurrently, it remains unclear whether Thompson sampling (TS) algorithms, which have superior empirical performance in general, share similar theoretical properties. In this work, we present a TS-type algorithm for a more general online multi-task learning protocol, which extends the concurrent setting. We provide its frequentist analysis and prove that it is also nearly-optimal using a novel concentration inequality for multi-task data aggregation at random stopping times. Finally, we evaluate the algorithm on synthetic data and show that the TS-type algorithm enjoys superior empirical performance in comparison with the UCB-based algorithm and a baseline algorithm that performs TS for each individual task without transfer.

本研究针对具有相似但不完全相同的多臂赌博机环境中的在线多任务学习问题，研究了如何通过知识的健壮传递从而提高学习器在多个相关任务上的整体性能。我们提出了一种TS类型算法，对其进行了经验分析，并证明了它是几乎最优的。最后，我们将算法在合成数据上进行了评估，证明了TS类型算法在与基准算法和UCB算法的比较中具有卓越的经验性能。

多任务赌博机中的稳健转移的汤普森取样