We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT,
which introduces a novel kaleido strategy for fashion cross-modality
representations from transformers. In contrast to random masking strategy of
recent VL models, we design alignment guided masking to jointly focus more on
image-text semantic relations. To this end, we carry out five novel tasks,
i.e., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for
self-supervised VL pre-training at patches of different scale. Kaleido-BERT is
conceptually simple and easy to extend to the existing BERT framework, it
attains new state-of-the-art results by large margins on four downstream tasks,
including text retrieval (R@1: 4.03% absolute improvement), image retrieval
(R@1: 7.13% abs imv.), category recognition (ACC: 3.28% abs imv.), and fashion
captioning (Bleu4: 1.2 abs imv.). We validate the efficiency of Kaleido-BERT on
a wide range of e-commerical websites, demonstrating its broader potential in
real-world applications.

Kaleido-BERT 是一种新型的视觉 - 语言预训练模型，采用对齐引导的遮盖策略和五项自监督任务进行 VL 预训练，实现了更好的图像 - 文本语义关系表示，并在四个下游任务上实现了领先的性能，特别是时装图像标注任务上，展示了其在实际应用中的广泛潜力。