Guiming Cao, Kaize Shi, Hong Fu, Huaiwen Zhang, Guandong Xu
TL;DR通过使用Token-wise Adaptive for Multi-modal Prompt Learning (APLe)在顺序方式中对视觉和语言两个模态的提示进行调整,APLe解决了视觉-语言模型中的挑战,提高了提示学习的性能,具有与最先进技术相媲美的泛化性能。
Abstract
Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the v-l model have been explored in existing research including the challenge of the sensitivity to text input and the tuning proce