We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.

设计一种面向时尚领域的多模式表达模型，使用视觉转换器架构代替预训练模型BERT，实现端到端框架，并使用遮蔽图像重构实现对时尚领域的细粒度理解。该模型没有使用额外的预处理模型（如ResNet），能轻松推广到各种匹配和生成任务中，并得到了提取（rank@5: 17%）和识别（准确度: 3%）任务结果的明显改进。

时尚领域的遮蔽视觉语言变压器