Visually-grounded models of spoken language understanding extract semantic
information directly from speech, without relying on transcriptions. This is
useful for low-resource languages, where transcriptions can be expensive or
impossible to obtain. Recent work showed that these models can be improved if
transcriptions are available at training time. However, it is not clear how an
end-to-end approach compares to a traditional pipeline-based approach when one
has access to transcriptions. Comparing different strategies, we find that the
pipeline approach works better when enough text is available. With low-resource
languages in mind, we also show that translations can be effectively used in
place of transcriptions but more data is needed to obtain similar results.

这篇研究论文主要研究了通过视觉模型来对口语语音进行语义理解，在低资源语言中通过传统的管道方法和端到端方法来提高模型性能，并比较发现管道方法比端到端方法更适用于足够的文本情况下，而翻译可以有效地代替转录，但需要更多的数据才能获得类似的结果。