We introduce the task of spatially localizing narrated interactions in
videos. Key to our approach is the ability to learn to spatially localize
interactions with self-supervision on a large corpus of videos with
accompanying transcribed narrations. To achieve this goal, we propose a
multilayer cross-modal attention network that enables effective optimization of
a contrastive loss during training. We introduce a divided strategy that
alternates between computing inter- and intra-modal attention across the visual
and natural language modalities, which allows effective training via directly
contrasting the two modalities' representations. We demonstrate the
effectiveness of our approach by self-training on the HowTo100M instructional
video dataset and evaluating on a newly collected dataset of localized
described interactions in the YouCook2 dataset. We show that our approach
outperforms alternative baselines, including shallow co-attention and full
cross-modal attention. We also apply our approach to grounding phrases in
images with weak supervision on Flickr30K and show that stacking multiple
attention layers is effective and, when combined with a word-to-region loss,
achieves state of the art on recall-at-one and pointing hand accuracies.

该文介绍了将叙述交互视频逐帧定位的任务，并通过一个多层交叉模态注意力网络实现自我监督的效果，其中交替计算视觉和自然语言模态的相互关注，以有效地进行训练，其表现超过基线模型包括浅层和全跨模态关注。