In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge
and SUPERB benchmark. Our submissions are based on the recently proposed
FaST-VGS model, which is a Transformer-based model that learns to associate raw
speech waveforms with semantically related images, all without the use of any
transcriptions of the speech. Additionally, we introduce a novel extension of
this model, FaST-VGS+, which is learned in a multi-task fashion with a masked
language modeling objective in addition to the visual grounding objective. On
ZeroSpeech 2021, we show that our models perform competitively on the ABX task,
outperform all other concurrent submissions on the Syntactic and Semantic
tasks, and nearly match the best system on the Lexical task. On the SUPERB
benchmark, we show that our models also achieve strong performance, in some
cases even outperforming the popular wav2vec2.0 model.

本研究基于最近提出的 FaST-VGS 模型，该模型是一种基于 Transformer 的模型，学习将原始语音波形与语义相关的图像相关联，同时引入了一种新颖的扩展模型 FaST-VGS +，该模型在多任务训练中学习了掩码语言建模目标和视觉基础目标。我们的研究在 ZeroSpeech 2021 Challenge 和 SUPERB benchmark 上表现强劲，几乎在 Lexical 任务上与最佳系统相媲美。