Audio-visual speech recognition (AVSR) research has gained a great success
recently by improving the noise-robustness of audio-only automatic speech
recognition (ASR) with noise-invariant visual information. However, most
existing AVSR approaches simply fuse the audio and visual features by
concatenation, without explicit interactions to capture the deep correlations
between them, which results in sub-optimal multimodal representations for
downstream speech recognition task. In this paper, we propose a cross-modal
global interaction and local alignment (GILA) approach for AVSR, which captures
the deep audio-visual (A-V) correlations from both global and local
perspectives. Specifically, we design a global interaction model to capture the
A-V complementary relationship on modality level, as well as a local alignment
approach to model the A-V temporal consistency on frame level. Such a holistic
view of cross-modal correlations enable better multimodal representations for
AVSR. Experiments on public benchmarks LRS3 and LRS2 show that our GILA
outperforms the supervised learning state-of-the-art.

本文提出了一种跨模态全局交互和局部对齐 (GILA) 方法，从全局和局部角度捕捉音频 - 视觉 (A-V) 间的深层相关性，用于改善音频 - 视觉语音识别中的多模态表示，实验结果表明我们的方法优于现有的有监督学习方法。

跨模态全局交互与局部对齐的视听语音识别

Cross-Modal Global Interaction and Local Alignment for Audio-Visual  Speech Recognition

The favorable performance of Vision Transformers (ViTs) is often attributed
to the multi-head self-attention (MSA). The MSA enables global interactions at
each layer of a ViT model, which is a contrasting feature against Convolutional
Neural Networks (CNNs) that gradually increase the range of interaction across
multiple layers. We study the role of the density of the attention. Our
preliminary analyses suggest that the spatial interactions of attention maps
are close to dense interactions rather than sparse ones. This is a curious
phenomenon, as dense attention maps are harder for the model to learn due to
steeper softmax gradients around them. We interpret this as a strong preference
for ViT models to include dense interaction. We thus manually insert the
uniform attention to each layer of ViT models to supply the much needed dense
interactions. We call this method Context Broadcasting, CB. We observe that the
inclusion of CB reduces the degree of density in the original attention maps
and increases both the capacity and generalizability of the ViT models. CB
incurs negligible costs: 1 line in your model code, no additional parameters,
and minimal extra operations.

通过研究 Vision Transformers 中的 self-attention 机制密度，得出了密集交互对模型的重要性，并提出了一种新的方法 ——Context Broadcasting (CB)，有效地提高了模型的容量和泛化能力。

用均匀注意力为视觉 Transformer 提供支持

Scratching Visual Transformer's Back with Uniform Attention

We present Mobile-Former, a parallel design of MobileNet and transformer with
a two-way bridge in between. This structure leverages the advantages of
MobileNet at local processing and transformer at global interaction. And the
bridge enables bidirectional fusion of local and global features. Different
from recent works on vision transformer, the transformer in Mobile-Former
contains very few tokens (e.g. 6 or fewer tokens) that are randomly initialized
to learn global priors, resulting in low computational cost. Combining with the
proposed light-weight cross attention to model the bridge, Mobile-Former is not
only computationally efficient, but also has more representation power. It
outperforms MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet
classification. For instance, Mobile-Former achieves 77.9\% top-1 accuracy at
294M FLOPs, gaining 1.3\% over MobileNetV3 but saving 17\% of computations.
When transferring to object detection, Mobile-Former outperforms MobileNetV3 by
8.6 AP in RetinaNet framework. Furthermore, we build an efficient end-to-end
detector by replacing backbone, encoder and decoder in DETR with Mobile-Former,
which outperforms DETR by 1.1 AP but saves 52\% of computational cost and 36\%
of parameters.

Mobile-Former 是一种结合 MobileNet 和 Transformer 的二元桥设计，其具有较低的计算成本和更强的表示能力，可以用于图像分类和对象检测，并在低 FLOP 区间内胜过 MobileNetV3 以及传统目标检测框架 DETR