In this paper, we explore neural network models that learn to associate
segments of spoken audio captions with the semantically relevant portions of
natural images that they refer to. We demonstrate that these audio-visual
associative localizations emerge from network-internal representations learned
as a by-product of training to perform an image-audio retrieval task. Our
models operate directly on the image pixels and speech waveform, and do not
rely on any conventional supervision in the form of labels, segmentations, or
alignments between the modalities during training. We perform analysis using
the Places 205 and ADE20k datasets demonstrating that our models implicitly
learn semantically-coupled object and word detectors.

本文针对语音与图像之间的语义关联关系，探讨了不需要传统监督方式的神经网络模型，并使用了 Places 205 和 ADE20k 数据集来验证模型，在不太需要标签、分割或模态对齐的情况下可以实现语音和图像的自动检索、详细定位以及进行时间、空间上的隐含的物体和单词检测。