While the Self-Attention mechanism in the Transformer model has proven to be
effective in many domains, we observe that it is less effective in more diverse
settings (e.g. multimodality) due to the varying granularity of each token and
the high computational demands of lengthy sequences. To address the challenges,
we introduce the Learnable Attention Mask (LAM), strategically designed to
globally regulate attention maps and prioritize critical tokens within the
sequence. Leveraging the Self-Attention module in a BERT-like transformer
network, our approach adeptly captures associations between tokens. The
extension of the LAM to a multi-layer version accommodates the varied
information aspects embedded at each layer of the Transformer network.
Comprehensive experimental validation on various datasets, such as MADv2,
QVHighlights, ImageNet 1K, and MSRVTT, demonstrates the efficacy of the LAM,
exemplifying its ability to enhance model performance while mitigating
redundant computations. This pioneering approach presents a significant
advancement in enhancing the understanding of complex scenarios, such as in
movie understanding.

通过引入可学习的注意力掩码（LAM）来全局调控注意力图并优先选择序列中的关键标记，该方法在 BERT-like transformer 网络中充分捕捉了标记之间的关联，通过对多层版本的 LAM 的扩展适应了 Transformer 网络各层的不同信息，实验证明该方法在不同数据集上有效地提升模型性能并减少冗余计算，从而对复杂情景理解方面，如电影理解等，取得了显著的进展。