It has recently been discovered that using a pre-trained vision-language
model (VLM), e.g., CLIP, to align a whole query image with several finer text
descriptions generated by a large language model can significantly enhance
zero-shot performance. However, in this paper, we empirically find that the
finer descriptions tend to align more effectively with local areas of the query
image rather than the whole image, and then we theoretically validate this
finding. Thus, we present a method called weighted visual-text cross alignment
(WCA). This method begins with a localized visual prompting technique, designed
to identify local visual areas within the query image. The local visual areas
are then cross-aligned with the finer descriptions by creating a similarity
matrix using the pre-trained VLM. To determine how well a query image aligns
with each category, we develop a score function based on the weighted
similarities in this matrix. Extensive experiments demonstrate that our method
significantly improves zero-shot performance across various datasets, achieving
results that are even comparable to few-shot learning methods.

使用预训练的视觉 - 语言模型对查询图像和细致的文本描述进行对齐可以显著增强零样本性能，因此我们提出了一种加权视觉 - 文本交叉对齐（WCA）方法，该方法通过局部视觉提示技术确定查询图像中的局部视觉区域，并通过创建基于预训练视觉 - 语言模型的相似性矩阵将这些局部视觉区域与细致的描述进行对齐，然后根据此矩阵中的加权相似度开发了一个评分函数来确定查询图像与每个类别的对齐情况，实验证明我们的方法显著提高了零样本性能，结果甚至可以与少样本学习方法相媲美。