Given a video and a description sentence with one missing word (we call it the "source sentence"), Video-Fill-In-the-Blank (VFIB) problem is to find the missing word automatically. The contextual information of the sentence, as well as visual cues from the video, are important to infer the missing word accurately. Since the source sentence is broken into two fragments: the sentence's left fragment (before the blank) and the sentence's right fragment (after the blank), traditional Recurrent Neural Networks cannot encode this structure accurately because of many possible variations of the missing word in terms of the location and type of the word in the source sentence. For example, a missing word can be the first word or be in the middle of the sentence and it can be a verb or an adjective. In this paper, we propose a framework to tackle the textual encoding: Two separate LSTMs (the LR and RL LSTMs) are employed to encode the left and right sentence fragments and a novel structure is introduced to combine each fragment with an "external memory" corresponding the opposite fragments. For the visual encoding, end-to-end spatial and temporal attention models are employed to select discriminative visual representations to find the missing word. In the experiments, we demonstrate the superior performance of the proposed method on challenging VFIB problem. Furthermore, we introduce an extended and more generalized version of VFIB, which is not limited to a single blank. Our experiments indicate the generalization capability of our method in dealing with such more realistic scenarios.

本文提出了一种框架来解决视频中的句子填空问题，该框架使用两个分开的LSTM来编码左右句子片段，引入了一个新的结构，将每个片段与相反的片段对应的外部记忆组合起来，并使用端到端的空间和时间注意模型选择区分性视觉表示来找到缺失的单词，实验证明了所提出的方法在具有挑战性的VFIB问题上的卓越性能。

使用带有空间-时间注意力的LR/RL LSTM进行视频填空