Training large foundation models using self-supervised objectives on
unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a
standard procedure. Unfortunately, the efficacy of this approach is often
constrained by both limited fine-tuning compute and scarcity in labeled
downstream data. We introduce Multimodal Attention Merging (MAM), an attempt
that facilitates direct knowledge transfer from attention matrices of models
rooted in high resource modalities, text and images, to those in
resource-constrained domains, speech and audio, employing a zero-shot paradigm.
MAM reduces the relative Word Error Rate (WER) of an Automatic Speech
Recognition (ASR) model by up to 6.70%, and relative classification error of an
Audio Event Classification (AEC) model by 10.63%. In cases where some
data/compute is available, we present Learnable-MAM, a data-driven approach to
merging attention matrices, resulting in a further 2.90% relative reduction in
WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning.

使用自我监督目标进行大型基础模型的训练，然后在下游任务中进行微调已成为一种标准程序。我们介绍了多模态注意力融合（MAM）方法，通过零 - shot 范式，实现了从高资源模态（文本和图像）的注意力矩阵到资源受限领域（语音和音频）的知识转移。MAM 可将自动语音识别（ASR）模型的相对字错误率（WER）降低多达 6.70％，将音频事件分类（AEC）模型的相对分类错误率降低 10.63％。在一些数据 / 计算资源可用的情况下，我们提出了可学习的 MAM 方法，用于合并注意力矩阵，进一步将 ASR 的 WER 降低 2.90％，AEC 降低 18.42％，相对于微调方法。