Large pre-trained Vision-Language Models (VLMs), like CLIP, exhibit strong
generalization ability to downstream tasks but struggle in few-shot scenarios.
Existing prompting techniques primarily focus on global text and image
representations, yet overlooking multi-modal attribute characteristics. This
limitation hinders the model's ability to perceive fine-grained visual details
and restricts its generalization ability to a broader range of unseen classes.
To address this issue, we propose a Multi-modal Attribute Prompting method
(MAP) by jointly exploring textual attribute prompting, visual attribute
prompting, and attribute-level alignment. The proposed MAP enjoys several
merits. First, we introduce learnable visual attribute prompts enhanced by
textual attribute semantics to adaptively capture visual attributes for images
from unknown categories, boosting fine-grained visual perception capabilities
for CLIP. Second, the proposed attribute-level alignment complements the global
alignment to enhance the robustness of cross-modal alignment for
open-vocabulary objects. To our knowledge, this is the first work to establish
cross-modal attribute-level alignment for CLIP-based few-shot adaptation.
Extensive experimental results on 11 datasets demonstrate that our method
performs favorably against state-of-the-art approaches.

我们提出了一种多模态属性提示方法（MAP），通过同时探索文本属性提示、视觉属性提示和属性级对齐来解决大规模预训练视觉 - 语言模型（VLMs）在少样本情况下的一些局限性，实验结果表明我们的方法在 11 个数据集上表现优于现有方法。

视觉 - 语言模型的多模态特征提示

Multi-modal Attribute Prompting for Vision-Language Models

We study the task of generating profitable Non-Fungible Token (NFT) images
from user-input texts. Recent advances in diffusion models have shown great
potential for image generation. However, existing works can fall short in
generating visually-pleasing and highly-profitable NFT images, mainly due to
the lack of 1) plentiful and fine-grained visual attribute prompts for an NFT
image, and 2) effective optimization metrics for generating high-quality NFT
images. To solve these challenges, we propose a Diffusion-based generation
framework with Multiple Visual-Policies as rewards (i.e., Diffusion-MVP) for
NFT images. The proposed framework consists of a large language model (LLM), a
diffusion-based image generator, and a series of visual rewards by design.
First, the LLM enhances a basic human input (such as "panda") by generating
more comprehensive NFT-style prompts that include specific visual attributes,
such as "panda with Ninja style and green background." Second, the
diffusion-based image generator is fine-tuned using a large-scale NFT dataset
to capture fine-grained image styles and accessory compositions of popular NFT
elements. Third, we further propose to utilize multiple visual-policies as
optimization goals, including visual rarity levels, visual aesthetic scores,
and CLIP-based text-image relevances. This design ensures that our proposed
Diffusion-MVP is capable of minting NFT images with high visual quality and
market value. To facilitate this research, we have collected the largest
publicly available NFT image dataset to date, consisting of 1.5 million
high-quality images with corresponding texts and market values. Extensive
experiments including objective evaluations and user studies demonstrate that
our framework can generate NFT images showing more visually engaging elements
and higher market value, compared with SOTA approaches.

本文提出了一种基于扩散模型的 NFT 图像生成框架 Diffusion-MVP，该框架使用多个视觉策略作为奖励，包含多样化的视觉稀有度、视觉美学评分和基于 CLIP 的文本 - 图像相关性。实验结果表明我们的框架生成的 NFT 图像较之现有最佳方法具有更高的视觉质量和市场价值。