As the size of transformer-based models continues to grow, fine-tuning these large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. Parameter-efficient learning has been developed to reduce the number of tunable parameters during fine-tuning. Although these methods show promising results, there is still a significant performance gap compared to full fine-tuning. To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E^2VPT) approach for large-scale transformer-based model adaptation. Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning. Moreover, we design a prompt pruning procedure to systematically prune low importance prompts while preserving model performance, which largely enhances the model's efficiency. Empirical results demonstrate that our approach outperforms several state-of-the-art baselines on two benchmarks, with considerably low parameter usage (e.g., 0.32% of model parameters on VTAB-1k). Our code is available at https://github.com/ChengHan111/E2VPT.

提出了一种有效和高效的视觉提示调整(E^2VPT)方法来实现大规模基于Transformer的模型适应，该方法通过引入一组可学习的键值提示和视觉提示分别到自注意力和输入层，以提高模型微调的效果，并设计了提示修剪程序来系统地修剪低重要性的提示，同时保持模型性能，极大地提升了模型的效率。实证结果表明，我们的方法在两个基准测试上优于几种最先进的基线模型，并且参数使用非常低(例如，在VTAB-1k上，模型参数的0.32%)。

E^2VPT: 一种有效高效的图像提示调整方法