Visual Speech Recognition (VSR) is the task of predicting spoken words from
silent lip movements. VSR is regarded as a challenging task because of the
insufficient information on lip movements. In this paper, we propose an Audio
Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement
the insufficient speech information of visual modality by using audio modality.
Different from the previous methods, the proposed AKVSR 1) utilizes rich audio
knowledge encoded by a large-scale pretrained audio model, 2) saves the
linguistic information of audio knowledge in compact audio memory by discarding
the non-linguistic information from the audio through quantization, and 3)
includes Audio Bridging Module which can find the best-matched audio features
from the compact audio memory, which makes our training possible without audio
inputs, once after the compact audio memory is composed. We validate the
effectiveness of the proposed method through extensive experiments, and achieve
new state-of-the-art performances on the widely-used datasets, LRS2 and LRS3.

提出了一种基于音频知识的视觉语音识别框架（AKVSR），通过使用音频模态来补充视觉模态中不足的语音信息，利用预训练的大规模音频模型编码丰富的音频知识，并通过量化舍弃非语言信息从而将语言信息保存在紧凑的音频存储器中，并包括能够从紧凑的音频存储器中找到最佳匹配音频特征的音频桥接模块，使得训练过程不需要音频输入，通过广泛的实验验证了该方法的有效性，并在广泛使用的数据集 LRS2 和 LRS3 上取得了最新的最佳表现。

AKVSR: 基于压缩预训练模型的音频知识增强的视觉语音识别

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by  Compressing Audio Knowledge of a Pretrained Model

This paper presents MAST, a new model for Multimodal Abstractive Text
Summarization that utilizes information from all three modalities -- text,
audio and video -- in a multimodal video. Prior work on multimodal abstractive
text summarization only utilized information from the text and video
modalities. We examine the usefulness and challenges of deriving information
from the audio modality and present a sequence-to-sequence trimodal
hierarchical attention-based model that overcomes these challenges by letting
the model pay more attention to the text modality. MAST outperforms the current
state of the art model (video-text) by 2.51 points in terms of Content F1 score
and 1.00 points in terms of Rouge-L score on the How2 dataset for multimodal
language understanding.

本文提出了 MAST，一种新的多模态抽象文本摘要模型，它利用来自文本、音频和视频三种模态的信息。MAST 通过让模型更多地关注文本模态来解决从音频模态中提取信息的有用性和挑战，并在 How2 数据集上在多模态语言理解方面以 2.51 分的内容 F1 分数和 1.00 分的 Rouge-L 分数优于目前的基于视频和文本的最佳模型。

MAST: 多模态抽象摘要生成与三模态分层注意力

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical  Attention

Dense video captioning is a task of localizing interesting events from an
untrimmed video and producing textual description (captions) for each localized
event. Most of the previous works in dense video captioning are solely based on
visual information and completely ignore the audio track. However, audio, and
speech, in particular, are vital cues for a human observer in understanding an
environment. In this paper, we present a new dense video captioning approach
that is able to utilize any number of modalities for event description.
Specifically, we show how audio and speech modalities may improve a dense video
captioning model. We apply automatic speech recognition (ASR) system to obtain
a temporally aligned textual description of the speech (similar to subtitles)
and treat it as a separate input alongside video frames and the corresponding
audio track. We formulate the captioning task as a machine translation problem
and utilize recently proposed Transformer architecture to convert multi-modal
input data into textual descriptions. We demonstrate the performance of our
model on ActivityNet Captions dataset. The ablation studies indicate a
considerable contribution from audio and speech components suggesting that
these modalities contain substantial complementary information to video frames.
Furthermore, we provide an in-depth analysis of the ActivityNet Caption results
by leveraging the category tags obtained from original YouTube videos. Code is
publicly available: github.com/v-iashin/MDVC

本文提出了一种新的密集视频字幕方法，它能够利用任何数量的多模态信息来描述事件，并使用自动语音识别系统获得音频和语音模态的文本描述，在将其视为单独的输入与视频帧和相应的音轨一起使用，并利用最近提出的 Transformer 体系结构将多模态输入数据转换为文本描述的机器翻译问题。作者在 ActivityNet Captions 数据集上测试了他们的模型，并进行了深入的分析。