``Learning to hash'' is a practical solution for efficient retrieval,
offering fast search speed and low storage cost. It is widely applied in
various applications, such as image-text cross-modal search. In this paper, we
explore the potential of enhancing the performance of learning to hash with the
proliferation of powerful large pre-trained models, such as Vision-Language
Pre-training (VLP) models. We introduce a novel method named Distillation for
Cross-Modal Quantization (DCMQ), which leverages the rich semantic knowledge of
VLP models to improve hash representation learning. Specifically, we use the
VLP as a `teacher' to distill knowledge into a `student' hashing model equipped
with codebooks. This process involves the replacement of supervised labels,
which are composed of multi-hot vectors and lack semantics, with the rich
semantics of VLP. In the end, we apply a transformation termed Normalization
with Paired Consistency (NPC) to achieve a discriminative target for
distillation. Further, we introduce a new quantization method, Product
Quantization with Gumbel (PQG) that promotes balanced codebook learning,
thereby improving the retrieval performance. Extensive benchmark testing
demonstrates that DCMQ consistently outperforms existing supervised cross-modal
hashing approaches, showcasing its significant potential.

基于大规模预训练模型的学习哈希方法为跨模态检索提供了性能优化，并引入了一种名为 DCMQ 的新方法，利用 VLP 模型的语义知识改进了哈希表示学习，通过引入 PQG 量化方法和 NPC 转换进一步提高了检索性能。

视觉语言预训练的精简与有效的跨模态检索

Distilling Vision-Language Pretraining for Efficient Cross-Modal  Retrieval

Scene Designer is a novel method for searching and generating images using
free-hand sketches of scene compositions; i.e. drawings that describe both the
appearance and relative positions of objects. Our core contribution is a single
unified model to learn both a cross-modal search embedding for matching
sketched compositions to images, and an object embedding for layout synthesis.
We show that a graph neural network (GNN) followed by Transformer under our
novel contrastive learning setting is required to allow learning correlations
between object type, appearance and arrangement, driving a mask generation
module that synthesises coherent scene layouts, whilst also delivering state of
the art sketch based visual search of scenes.

Scene Designer 是一种利用手绘场景构图自由搜索和生成图像的新方法，其核心是一个学习跨模态搜索嵌入和布局合成物的单一统一模型，我们表明需要使用一个图网络和变压器进行对比学习以实现目标类型、外观和布局之间的相关性学习，驱动掩模生成模块，综合一致的场景布局，并提供了最先进的基于素描的视觉搜索场景。