注意力视觉关键词检测

Oct, 2021

Visual Keyword Spotting with Attention

K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman

TL;DR本研究提出Transpotter模型，使用全面的跨模态注意力机制在视觉和语音流之间进行交互，成功实现静默视频序列中的语音关键词检测，并且在多项测试中，优于当前视觉关键词检测和唇语识别模型，并具备较强的嘴型单词分离的能力。

Abstract

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate transformer-based models that ingest two streams, a