Imagine being able to show a system a visual depiction of a keyword and
finding spoken utterances that contain this keyword from a zero-resource speech
corpus. We formalise this task and call it visually prompted keyword
localisation (VPKL): given an image of a keyword, detect and predict where in
an utterance the keyword occurs. To do VPKL, we propose a speech-vision model
with a novel localising attention mechanism which we train with a new keyword
sampling scheme. We show that these innovations give improvements in VPKL over
an existing speech-vision model. We also compare to a visual bag-of-words (BoW)
model where images are automatically tagged with visual labels and paired with
unlabelled speech. Although this visual BoW can be queried directly with a
written keyword (while our's takes image queries), our new model still
outperforms the visual BoW in both detection and localisation, giving a 16%
relative improvement in localisation F1.

该论文提出了视觉提示关键字定位 (VPKL) 任务，旨在通过一个具有新型定位注意力机制的语音视觉模型，使用一个新的关键字采样方案定位和预测输入中的关键字，相较于基于视觉词袋模型（Visual BoW）的检测和定位，VPKL 模型在关键字检测和定位精度上都有了提高，定位 F1 值相较词袋模型提升了 16%。