BriefGPT.xyz
Apr, 2022
利用单模编码器进行视觉语言任务的多模适应蒸馏
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
HTML
PDF
Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai...
TL;DR
提出了一种名为MAD的方法,可以使用预训练的单模态视觉和文本编码器对跨模态VL编码器进行自适应蒸馏,从而提高了跨模态学习的性能,特别是在VCR领域取得了SOTA表现。
Abstract
Cross-modal encoders for
vision-language
(VL) tasks are often pretrained with carefully curated
vision-language
datasets. While these datasets reach an order of 10 million samples, the labor cost is prohibitive t
→