With the recent advancements in AI, Intelligent Virtual Assistants (IVA) have
become a ubiquitous part of every home. Going forward, we are witnessing a
confluence of vision, speech and dialog system technologies that are enabling
the IVAs to learn audio-visual groundings of utterances and have conversations
with users about the objects, activities and events surrounding them. As a part
of the 7th Dialog System Technology Challenges (DSTC7), for Audio Visual
Scene-Aware Dialog (AVSD) track, We explore `topics' of the dialog as an
important contextual feature into the architecture along with explorations
around multimodal Attention. We also incorporate an end-to-end audio
classification ConvNet, AclNet, into our models. We present detailed analysis
of the experiments and show that some of our model variations outperform the
baseline system presented for this task.

通过结合多模式注意力机制与端到端音频分类卷积神经网络，实现智能虚拟助手（IVA）对语音、视觉场景的理解与自然对话，超越了基准系统表现。

音频视觉场景感知对话的上下文、注意力和音频特征探索

Context, Attention and Audio Feature Explorations for Audio Visual  Scene-Aware Dialog

Audio-visual speech recognition (AVSR) system is thought to be one of the
most promising solutions for robust speech recognition, especially in noisy
environment. In this paper, we propose a novel multimodal attention based
method for audio-visual speech recognition which could automatically learn the
fused representation from both modalities based on their importance. Our method
is realized using state-of-the-art sequence-to-sequence (Seq2seq)
architectures. Experimental results show that relative improvements from 2% up
to 36% over the auditory modality alone are obtained depending on the different
signal-to-noise-ratio (SNR). Compared to the traditional feature concatenation
methods, our proposed approach can achieve better recognition performance under
both clean and noisy conditions. We believe modality attention based end-to-end
method can be easily generalized to other multimodal tasks with correlated
information.

该研究提出了一种基于多模态注意力的音视频语音识别方法，该方法使用了最先进的 Seq2seq 架构，基于它们的重要性自动学习了来自两种模态的混合表示，并在不同信噪比下相对于单独的音频模态获得了 2% 到 36% 的提高，相比传统的特征级联方法，在清洁和嘈杂的条件下均能获得更好的识别性能，可以轻松地推广到其他多模态任务中。

端到端音视频语音识别的模态注意力

Modality Attention for End-to-End Audio-visual Speech Recognition

The attention mechanism is an important part of the neural machine
translation (NMT) where it was reported to produce richer source representation
compared to fixed-length encoding sequence-to-sequence models. Recently, the
effectiveness of attention has also been explored in the context of image
captioning. In this work, we assess the feasibility of a multimodal attention
mechanism that simultaneously focus over an image and its natural language
description for generating a description in another language. We train several
variants of our proposed attention mechanism on the Multi30k multilingual image
captioning dataset. We show that a dedicated attention for each modality
achieves up to 1.6 points in BLEU and METEOR compared to a textual NMT
baseline.

本文将多模态注意力机制应用于图像字幕生成领域，通过在自然语言描述和图像上同时聚焦，实现了一种基于图像字幕的另一种语言描述生成方法，并在 Multi30k 数据集上取得了更好的效果。