Machines that can represent and describe environmental soundscapes have
practical potential, e.g., for audio tagging and captioning systems. Prevailing
learning paradigms have been relying on parallel audio-text data, which is,
however, scarcely available on the web. We propose VIP-ANT that induces
\textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text
data. Our key idea is to share the image modality between bi-modal image-text
representations and bi-modal image-audio representations; the image modality
functions as a pivot and connects audio and text in a tri-modal embedding space
implicitly.
In a difficult zero-shot setting with no paired audio-text data, our model
demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio
classification tasks, and even surpasses the supervised state of the art for
Clotho caption retrieval (with audio queries) by 2.2\% R@1. We further
investigate cases of minimal audio-text supervision, finding that, e.g., just a
few hundred supervised audio-text pairs increase the zero-shot audio
classification accuracy by 8\% on US8K. However, to match human parity on some
zero-shot tasks, our empirical scaling experiments suggest that we would need
about $2^{21} \approx 2M$ supervised audio-caption pairs. Our work opens up new
avenues for learning audio-text connections with little to no parallel
audio-text data.

提出了一种称为 VIP-ANT 的模型，实现了音频文本无对齐数据的自动对齐，应用在零 - shot 音频分类和字幕检索任务中取得了良好的性能，甚至超越了更传统的监督学习模型。同时也发现，虽然仅需一些监督数据就可以提高性能，但达到人类水平仍然需要更大规模的数据。

通过视觉知识转移在无平行数据的情况下，连接音频和文本之间的关联

Connecting the Dots between Audio and Text without Parallel Data through  Visual Knowledge Transfer

Visual Question Answering (VQA) is the task of taking as input an image and a
free-form natural language question about the image, and producing an accurate
answer. In this work we view VQA as a "feature extraction" module to extract
image and caption representations. We employ these representations for the task
of image-caption ranking. Each feature dimension captures (imagines) whether a
fact (question-answer pair) could plausibly be true for the image and caption.
This allows the model to interpret images and captions from a wide variety of
perspectives. We propose score-level and representation-level fusion models to
incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic
image-caption ranking model. We find that incorporating and reasoning about
consistency between images and captions significantly improves performance.
Concretely, our model improves state-of-the-art on caption retrieval by 7.1%
and on image retrieval by 4.4% on the MSCOCO dataset.

本研究将视觉问题回答任务视为 “特征提取” 模块，提取图像和标题的表征，以此为基础对图像 - 标题进行排序并提出融合模型提高图像 - 标题匹配一致性的表现。实验发现，该模型在 MSCOCO 数据集上的字幕检索提高了 7.1％，图像提取提高了 4.4％。