Scene graphs are a powerful structured representation of the underlying
content of images, and embeddings derived from them have been shown to be
useful in multiple downstream tasks. In this work, we employ a graph
convolutional network to exploit structure in scene graphs and produce image
embeddings useful for semantic image retrieval. Different from
classification-centric supervision traditionally available for learning image
representations, we address the task of learning from relative similarity
labels in a ranking context. Rooted within the contrastive learning paradigm,
we propose a novel loss function that operates on pairs of similar and
dissimilar images and imposes relative ordering between them in embedding
space. We demonstrate that this Ranking loss, coupled with an intuitive triple
sampling strategy, leads to robust representations that outperform well-known
contrastive losses on the retrieval task. In addition, we provide qualitative
evidence of how retrieved results that utilize structured scene information
capture the global context of the scene, different from visual similarity
search.

本文探讨利用图卷积网络对场景图进行结构化表示并生成有用的语义图像嵌入的方法，通过相似性标签学习图像表示，提出一种新的排序损失函数并设计三元采样策略，实验表明此方法优于已知相似性损失，且能够很好地捕捉场景的全局信息。

使用相对相似性监督的场景图嵌入

Scene Graph Embeddings Using Relative Similarity Supervision

In this paper, we present a method for learning discrete linguistic units by
incorporating vector quantization layers into neural models of visually
grounded speech. We show that our method is capable of capturing both
word-level and sub-word units, depending on how it is configured. What
differentiates this paper from prior work on speech unit learning is the choice
of training objective. Rather than using a reconstruction-based loss, we use a
discriminative, multimodal grounding objective which forces the learned units
to be useful for semantic image retrieval. We evaluate the sub-word units on
the ZeroSpeech 2019 challenge, achieving a 27.3\% reduction in ABX error rate
over the top-performing submission, while keeping the bitrate approximately the
same. We also present experiments demonstrating the noise robustness of these
units. Finally, we show that a model with multiple quantizers can
simultaneously learn phone-like detectors at a lower layer and word-like
detectors at a higher layer. We show that these detectors are highly accurate,
discovering 279 words with an F1 score of greater than 0.5.

本文中提出了一种通过将向量量化层整合到基于视觉语音的神经模型中来学习离散语言单元的方法。我们展示了我们的方法可以捕捉到单词级别和子词级别的单元，我们还将子词单元应用到了 ZeroSpeech 2019 挑战中，取得了较好的结果。

从视觉语言到学习层级离散语言单元

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded  Speech

Self-Supervised learning from multimodal image and text data allows deep
neural networks to learn powerful features with no need of human annotated
data. Web and Social Media platforms provide a virtually unlimited amount of
this multimodal data. In this work we propose to exploit this free available
data to learn a multimodal image and text embedding, aiming to leverage the
semantic knowledge learnt in the text domain and transfer it to a visual model
for semantic image retrieval. We demonstrate that the proposed pipeline can
learn from images with associated textwithout supervision and analyze the
semantic structure of the learnt joint image and text embedding space. We
perform a thorough analysis and performance comparison of five different state
of the art text embeddings in three different benchmarks. We show that the
embeddings learnt with Web and Social Media data have competitive performances
over supervised methods in the text based image retrieval task, and we clearly
outperform state of the art in the MIRFlickr dataset when training in the
target data. Further, we demonstrate how semantic multimodal image retrieval
can be performed using the learnt embeddings, going beyond classical
instance-level retrieval problems. Finally, we present a new dataset,
InstaCities1M, composed by Instagram images and their associated texts that can
be used for fair comparison of image-text embeddings.

通过利用 Web 和 Social Media 数据，本文提出一种利用多模态图像和文本嵌入的自监督学习方法，在不需要人工注释的情况下学习强大的特征，并将文本领域学到的语义知识转移至视觉模型用于语义图像检索任务。研究分析了五种不同的文本嵌入方法，表明利用 Web 和 Social Media 数据学习的嵌入具有与监督方法相当的性能，且在训练目标数据时优于最先进方法。最后，介绍了 InstaCities1M 数据集，并演示了如何利用该数据集进行语义多模态图像检索。