With the increased accessibility of web and online encyclopedias, the amount
of data to manage is constantly increasing. In Wikipedia, for example, there
are millions of pages written in multiple languages. These pages contain images
that often lack the textual context, remaining conceptually floating and
therefore harder to find and manage. In this work, we present the system we
designed for participating in the Wikipedia Image-Caption Matching challenge on
Kaggle, whose objective is to use data associated with images (URLs and visual
data) to find the correct caption among a large pool of available ones. A
system able to perform this task would improve the accessibility and
completeness of multimedia content on large online encyclopedias. Specifically,
we propose a cascade of two models, both powered by the recent Transformer
model, able to efficiently and effectively infer a relevance score between the
query image data and the captions. We verify through extensive experimentation
that the proposed two-model approach is an effective way to handle a large pool
of images and captions while maintaining bounded the overall computational
complexity at inference time. Our approach achieves remarkable results,
obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the
private leaderboard of the Kaggle challenge.

本文介绍了我们为参加 Kaggle 上的 Wikipedia 图像 - 字幕匹配挑战而设计的系统，该系统使用与图像相关的数据（URL 和视觉数据）来在一个庞大的字幕库中找到正确的字幕。我们提出了两个基于 Transformer 模型的级联模型，能有效地推断查询图像数据与字幕之间的相关度，并通过广泛的实验验证了其在处理大量的图像和字幕时的效果，同时完成一定的验证时间复杂度。在 Kaggle 的私人排名中，我们的方法的标准化折扣累积增益值（nDCG）达到了 0.53。