Multi-task learning (MTL) aims to improve the performance of a primary task
by jointly learning with related auxiliary tasks. Traditional MTL methods
select tasks randomly during training. However, both previous studies and our
results suggest that such the random selection of tasks may not be helpful, and
can even be harmful to performance. Therefore, new strategies for task
selection and assignment in MTL need to be explored. This paper studies the
multi-modal, multi-task dialogue act classification task, and proposes a method
for selecting and assigning tasks based on non-stationary multi-armed bandits
(MAB) with discounted Thompson Sampling (TS) using Gaussian priors. Our
experimental results show that in different training stages, different tasks
have different utility. Our proposed method can effectively identify the task
utility, actively avoid useless or harmful tasks, and realise the task
assignment during training. Our proposed method is significantly superior in
terms of UAR and F1 to the single-task and multi-task baselines with p-values <
0.05. Further analysis of experiments indicates that for the dataset with the
data imbalance problem, our proposed method has significantly higher stability
and can obtain consistent and decent performance for minority classes. Our
proposed method is superior to the current state-of-the-art model.

提出了一种基于非静态多臂赌博机的折扣汤普森采样的多模态多任务对话行为分类任务的任务选择和分配方法，结果表明，该方法在不同的训练阶段可以有效地识别任务效用，并在训练过程中主动避免无用或有害的任务，相比单任务和多任务基线模型在 UAR 和 F1 方面显著优越，P 值小于 0.05，此外，对实验进一步分析表明，对于数据不平衡问题的数据集，该方法具有显著更高的稳定性，并且能够获得一致且良好的少数类性能，相较于当前最先进的模型，该方法更为优越。

多模多任务对话行为分类的任务选择和分配及非平稳多臂赌博机方法

Task Selection and Assignment for Multi-modal Multi-task Dialogue Act  Classification with Non-stationary Multi-armed Bandits

Restless bandit problems are instances of non-stationary multi-armed bandits.
These problems have been studied well from the optimization perspective, where
the goal is to efficiently find a near-optimal policy when system parameters
are known. However, very few papers adopt a learning perspective, where the
parameters are unknown. In this paper, we analyze the performance of Thompson
sampling in episodic restless bandits with unknown parameters. We consider a
general policy map to define our competitor and prove an
$\tilde{\mathcal{O}}(\sqrt{T})$ Bayesian regret bound. Our competitor is
flexible enough to represent various benchmarks including the best fixed action
policy, the optimal policy, the Whittle index policy, or the myopic policy. We
also present empirical results that support our theoretical findings.

本文从学习的角度分析了未知参数情况下的时序不息不静赌博机问题，在采用泰普斯抽样的情况下考虑了一个通用策略映射作为竞争者，证明了贝叶斯遗憾的 k 倍增长上限。本文的竞争对手足够灵活，可以表示各种基准，包括最佳固定操作策略，最优策略，惠特尔指数策略或近视策略。同时，还提供了支持理论发现的实证结果。