May, 2024

增强视觉语言模型的未屏蔽令牌对齐

TL;DRContrastive pre-training techniques like CLIP are computationally demanding, while Unmasked Token Alignment (UTA) leverages CLIP models to enhance vision-language representations with a Vision Transformer (ViT) that does not require training on image-text pairs, outperforming existing methods.