The recently proposed visually grounded speech model SpeechCLIP is an
innovative framework that bridges speech and text through images via CLIP
without relying on text transcription. On this basis, this paper introduces two
extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire
(CIF) module to replace a fixed number of CLS tokens in the cascaded
architecture. Second, we propose a new hybrid architecture that merges the
cascaded and parallel architectures of SpeechCLIP into a multi-task learning
framework. Our experimental evaluation is performed on the Flickr8k and
SpokenCOCO datasets. The results show that in the speech keyword extraction
task, the CIF-based cascaded SpeechCLIP model outperforms the previous cascaded
SpeechCLIP model using a fixed number of CLS tokens. Furthermore, through our
hybrid architecture, cascaded task learning boosts the performance of the
parallel branch in image-speech retrieval tasks.

通过替换固定数量的 CLS 令牌，基于 Continuous Integrate-and-Fire 模块的级联 SpeechCLIP 模型在语音关键词提取任务中优于之前的级联 SpeechCLIP 模型。此外，通过混合架构，级联任务学习提升了图像 - 语音检索任务中并行分支的性能。

SpeechCLIP+: 自我监督多任务表示学习：用于语音的 CLIP 和语音 - 图像数据

SpeechCLIP+: Self-supervised multi-task representation learning for  speech via CLIP and speech-image data

We propose a visually grounded speech model that learns new words and their
visual depictions from just a few word-image example pairs. Given a set of test
images and a spoken query, we ask the model which image depicts the query word.
Previous work has simplified this few-shot learning problem by either using an
artificial setting with digit word-image pairs or by using a large number of
examples per class. Moreover, all previous studies were performed using English
speech-image data. We propose an approach that can work on natural word-image
pairs but with less examples, i.e. fewer shots, and then illustrate how this
approach can be applied for multimodal few-shot learning in a real low-resource
language, Yoruba. Our approach involves using the given word-image example
pairs to mine new unsupervised word-image training pairs from large collections
of unlabelledspeech and images. Additionally, we use a word-to-image attention
mechanism to determine word-image similarity. With this new model, we achieve
better performance with fewer shots than previous approaches on an existing
English benchmark. Many of the model's mistakes are due to confusion between
visual concepts co-occurring in similar contexts. The experiments on Yoruba
show the benefit of transferring knowledge from a multimodal model trained on a
larger set of English speech-image data

本研究提出了一种视觉语音模型，能够通过少量的图像和单词样本来学习新词汇及其视觉描述，并通过多模式少量样本的学习在低资源语言 Yoruba 中取得更好的表现。

低资源环境下基于视觉支撑的少样本词汇学习

Visually grounded few-shot word learning in low-resource settings

We propose a visually grounded speech model that acquires new words and their
visual depictions from just a few word-image example pairs. Given a set of test
images and a spoken query, we ask the model which image depicts the query word.
Previous work has simplified this problem by either using an artificial setting
with digit word-image pairs or by using a large number of examples per class.
We propose an approach that can work on natural word-image pairs but with less
examples, i.e. fewer shots. Our approach involves using the given word-image
example pairs to mine new unsupervised word-image training pairs from large
collections of unlabelled speech and images. Additionally, we use a
word-to-image attention mechanism to determine word-image similarity. With this
new model, we achieve better performance with fewer shots than any existing
approach.

本文提出了一个视觉和语音相融合的模型，用于从仅有几个词 - 图像样本对中学习新单词及其视觉表示。我们的方法包括从大量未标记的语音和图像中，使用给定的词 - 图像示例对挖掘新的无监督词 - 图像训练对，并使用单词到图像的关注机制来确定词 - 图像相似性。新模型的性能比现有方法更好，且需要更少的样本数。