There has been an increasing interest in multi-task learning for video understanding in recent years. In this work, we propose a generalized notion of multi-task learning by incorporating both auxiliary tasks that the model should perform well on and adversarial tasks that the model should not perform well on. We employ Necessary Condition Analysis (NCA) as a data-driven approach for deciding what category these tasks should fall in. Our novel proposed framework, Adversarial Multi-Task Neural Networks (AMT), penalizes adversarial tasks, determined by NCA to be scene recognition in the Holistic Video Understanding (HVU) dataset, to improve action recognition. This upends the common assumption that the model should always be encouraged to do well on all tasks in multi-task learning. Simultaneously, AMT still retains all the benefits of multi-task learning as a generalization of existing methods and uses object recognition as an auxiliary task to aid action recognition. We introduce two challenging Scene-Invariant test splits of HVU, where the model is evaluated on action-scene co-occurrences not encountered in training. We show that our approach improves accuracy by ~3% and encourages the model to attend to action features instead of correlation-biasing scene features.

本文提出了对多任务学习的一般化认识，通过同时引入模型应该擅长处理的辅助任务和模型不应该擅长处理的对抗任务，并采用基于数据的必要条件分析 (NCA) 方法来决定这些任务属于什么类型。我们提出的 AMT (Adversarial Multi-Task Neural Networks) 框架，通过惩罚 NCA 确定为全局视频理解 (HVU) 数据集中的场景识别，以提高动作识别的准确率。在保留多任务学习所有优势的同时，使用辅助任务——对象识别来帮助动作识别。我们引入了 HVU 的两个具有挑战性的场景不变的测试分裂，通过评估模型对于训练中未遇到的动作-场景共现的准确性进行实现，结果显示我们的方法的准确率提高了约 3%，同时鼓励模型关注动作特征而不是相关偏差场景特征。

使用必要条件分析识别辅助或对抗任务以进行对抗多任务视频理解