Audio-visual recognition (AVR) has been considered as a solution for speech
recognition tasks when the audio is corrupted, as well as a visual recognition
method used for speaker verification in multi-speaker scenarios. The approach
of AVR systems is to leverage the extracted information from one modality to
improve the recognition ability of the other modality by complementing the
missing information. The essential problem is to find the correspondence
between the audio and visual streams, which is the goal of this work. We
propose the use of a coupled 3D Convolutional Neural Network (3D-CNN)
architecture that can map both modalities into a representation space to
evaluate the correspondence of audio-visual streams using the learned
multimodal features. The proposed architecture will incorporate both spatial
and temporal information jointly to effectively find the correlation between
temporal information for different modalities. By using a relatively small
network architecture and much smaller dataset for training, our proposed method
surpasses the performance of the existing similar methods for audio-visual
matching which use 3D CNNs for feature representation. We also demonstrate that
an effective pair selection method can significantly increase the performance.
The proposed method achieves relative improvements over 20% on the Equal Error
Rate (EER) and over 7% on the Average Precision (AP) in comparison to the
state-of-the-art method.

本文提出了一种利用耦合三维卷积神经网络架构来映射音频和视频流到统一表示空间，从而有效地找到不同模态之间时间信息的关联性的 AVR 方法，并且相对于现有的采用 3D CNN 特征表示的视听匹配方法，使用较小的网络架构和数据集进行训练，我们的方法显著提高了性能，相比于最先进的方法 Equal Error Rate（EER）的相对改进超过 20% ，而平均准确度（AP）的相对改进超过 7%。