BriefGPT.xyz
Sep, 2024
多模态适配器用于视觉语言模型
Multi-Modal Adapter for Vision-Language Models
HTML
PDF
Dominykas Seputis, Serghei Mihailov, Soham Chatterjee, Zehao Xiao
TL;DR
本研究针对现有的轻量级适配方法在视觉和文本表征之间缺乏互动的问题,提出了一种名为多模态适配器的新方法。通过引入可训练的多头注意力层,该方法有效结合了图像和文本特征,实现了更好的模型通用性,并在未见类别上的表现优于现有的适配方法。
Abstract
Large pre-trained
Vision-Language Models
, such as
CLIP
, have demonstrated state-of-the-art performance across a wide range of image classification tasks, without requiring retraining. Few-shot
→