Automatically generating a natural language sentence to describe the content
of an input video is a very challenging problem. It is an essential multimodal
task in which auditory and visual contents are equally important. Although
audio information has been exploited to improve video captioning in previous
works, it is usually regarded as an additional feature fed into a black box
fusion machine. How are the words in the generated sentences associated with
the auditory and visual modalities? The problem is still not investigated. In
this paper, we make the first attempt to design an interpretable audio-visual
video captioning network to discover the association between words in sentences
and audio-visual sequences. To achieve this, we propose a multimodal
convolutional neural network-based audio-visual video captioning framework and
introduce a modality-aware module for exploring modality selection during
sentence generation. Besides, we collect new audio captioning and visual
captioning datasets for further exploring the interactions between auditory and
visual modalities for high-level video understanding. Extensive experiments
demonstrate that the modality-aware module makes our model interpretable on
modality selection during sentence generation. Even with the added
interpretability, our video captioning network can still achieve comparable
performance with recent state-of-the-art methods.

本论文介绍了一个多模态卷积神经网络视频字幕框架，通过引入模态感知模块，探索了视听交互对视频理解的影响，并证明该可解释模型在情况选择时取得了可比较的性能。

可解释的视听视频字幕生成尝试

An Attempt towards Interpretable Audio-Visual Video Captioning

In this paper, we present a system that associates faces with voices in a
video by fusing information from the audio and visual signals. The thesis
underlying our work is that an extremely simple approach to generating (weak)
speech clusters can be combined with visual signals to effectively associate
faces and voices by aggregating statistics across a video. This approach does
not need any training data specific to this task and leverages the natural
coherence of information in the audio and visual streams. It is particularly
applicable to tracking speakers in videos on the web where a priori information
about the environment (e.g., number of speakers, spatial signals for
beamforming) is not available. We performed experiments on a real-world dataset
using this analysis framework to determine the speaker in a video. Given a
ground truth labeling determined by human rater consensus, our approach had
~71% accuracy.

本文提出了一种音视频关联系统，将音频和视觉信号的信息融合，通过对视频中的统计数据的聚合实现面部和声音的有效关联，无需针对此任务的具体训练数据，并利用音频和视觉流中信息的自然相干性，特别适用于跟踪网络视频中的说话人，通过对真实数据集的实验表明，该方法的准确率约为 71％。