Recently, textual prompt tuning has shown inspirational performance in
adapting Contrastive Language-Image Pre-training (CLIP) models to natural image
quality assessment. However, such uni-modal prompt learning method only tunes
the language branch of CLIP models. This is not enough for adapting CLIP models
to AI generated image quality assessment (AGIQA) since AGIs visually differ
from natural images. In addition, the consistency between AGIs and user input
text prompts, which correlates with the perceptual quality of AGIs, is not
investigated to guide AGIQA. In this letter, we propose vision-language
consistency guided multi-modal prompt learning for blind AGIQA, dubbed
CLIP-AGIQA. Specifically, we introduce learnable textual and visual prompts in
language and vision branches of CLIP models, respectively. Moreover, we design
a text-to-image alignment quality prediction task, whose learned
vision-language consistency knowledge is used to guide the optimization of the
above multi-modal prompts. Experimental results on two public AGIQA datasets
demonstrate that the proposed method outperforms state-of-the-art quality
assessment models. The source code is available at
this https URL

提出了一种基于视觉 - 语言一致性指导的多模态提示学习方法，称为 CLIP-AGIQA，用于盲目的 AI 生成图像质量评估，该方法在两个公共 AGIQA 数据集上的实验结果表明其优于现有的质量评估模型。

盲人视觉 - 语言一致性引导的多模态提示学习用于 AI 生成图像质量评估

Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind  AI Generated Image Quality Assessment

Textual prompt tuning has demonstrated significant performance improvements
in adapting natural language processing models to a variety of downstream tasks
by treating hand-engineered prompts as trainable parameters. Inspired by the
success of textual prompting, several studies have investigated the efficacy of
visual prompt tuning. In this work, we present Visual Prompt Adaptation (VPA),
the first framework that generalizes visual prompting with test-time
adaptation. VPA introduces a small number of learnable tokens, enabling fully
test-time and storage-efficient adaptation without necessitating source-domain
information. We examine our VPA design under diverse adaptation settings,
encompassing single-image, batched-image, and pseudo-label adaptation. We
evaluate VPA on multiple tasks, including out-of-distribution (OOD)
generalization, corruption robustness, and domain adaptation. Experimental
results reveal that VPA effectively enhances OOD generalization by 3.3% across
various models, surpassing previous test-time approaches. Furthermore, we show
that VPA improves corruption robustness by 6.5% compared to strong baselines.
Finally, we demonstrate that VPA also boosts domain adaptation performance by
relatively 5.2%. Our VPA also exhibits marked effectiveness in improving the
robustness of zero-shot recognition for vision-language models.

通过引入可学习的标记，VPA（Visual Prompt Adaptation）作为一个框架通过测试时间的自适应实现了视觉提示的普遍性，且不需要源领域信息，实验结果表明 VPA 有效提高了各种模型的历程泛化、抗干扰性和领域适应能力，以及对视觉 - 语言模型的零样本识别性能的鲁棒性改进。