Vision Transformers (ViTs) have become one of the dominant architectures in computer vision, and pre-trained ViT models are commonly adapted to new tasks via fine-tuning. Recent works proposed several parameter-efficient transfer learning methods, such as adapters, to avoid the prohibitive training and storage cost of finetuning. In this work, we observe that adapters perform poorly when the dimension of adapters is small, and we propose MiMi, a training framework that addresses this issue. We start with large adapters which can reach high performance, and iteratively reduce their size. To enable automatic estimation of the hidden dimension of every adapter, we also introduce a new scoring function, specifically designed for adapters, that compares the neuron importance across layers. Our method outperforms existing methods in finding the best trade-off between accuracy and trained parameters across the three dataset benchmarks DomainNet, VTAB, and Multi-task, for a total of 29 datasets.

通过引入适配器逐步减小其尺寸的方法，我们提出了MiMi训练框架，该框架能够在降低计算和存储成本的同时保持高性能，通过适配器层间神经元重要性的比较来自动估计每个适配器的隐藏维度，我们的方法在三个数据集基准DomainNet、VTAB和Multi-task上优于现有方法，寻找准确性和训练参数之间的最佳权衡。

小而强大：使用小适配器对ViTs进行微调