Deep Neural Network architectures with external memory components allow the model to perform inference and capture long term dependencies, by storing information explicitly. In this paper, we generalize Key-Value Memory Networks to a multimodal setting, introducing a novel key-addressing mechanism to deal with sequence-to-sequence models. The advantages of the framework are demonstrated on the task of video captioning, i.e generating natural language descriptions for videos. Conditioning on the previous time-step attention distributions for the key-value memory slots, we introduce a temporal structure in the memory addressing schema. The proposed model naturally decomposes the problem of video captioning into vision and language segments, dealing with them as key-value pairs. More specifically, we learn a semantic embedding (v) corresponding to each frame (k) in the video, thereby creating (k, v) memory slots. This allows us to exploit the temporal dependencies at multiple hierarchies (in the recurrent key-addressing; and in the language decoder). Exploiting this flexibility of the framework, we additionally capture spatial dependencies while mapping from the visual to semantic embedding. Extensive experiments on the Youtube2Text dataset demonstrate usefulness of recurrent key-addressing, while achieving competitive scores on BLEU@4, METEOR metrics against state-of-the-art models.

本文提出了Key-Value Memory Networks应用于多模态设置的方法，以及一种新的键寻址机制，将视频字幕生成问题自然地分解为视觉和语言端，将其作为键-值对处理，并在寻址模式下提出了一种递归关注的方法来捕捉语境信息，通过实验发现，这种方法可以提高BLEU@4，METEOR得分，并实现了与最先进方法竞争性能。

递归内存寻址描述视频