May, 2024
增强视觉语言模型的未屏蔽令牌对齐
Enhancing Vision-Language Model with Unmasked Token Alignment
Jihao Liu, Jinliang Zheng, Boxiao Liu, Yu Liu, Hongsheng Li
TL;DRContrastive pre-training techniques like CLIP are computationally demanding, while Unmasked Token Alignment (UTA) leverages CLIP models to enhance vision-language representations with a Vision Transformer (ViT) that does not require training on image-text pairs, outperforming existing methods.