Chenxin Tao, Xizhou Zhu, Shiqian Su, Lewei Lu, Changyao Tian...
TL;DR通过使用可学习的带通滤波器创建多样化的注意模式以及引入大规模且有计划的 drop path 率和全局池化特征的辅助损失来解决现有 1D 因果视觉模型中的“过度聚焦”问题,从而提高模型对多模态任务的性能。
Abstract
modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize →