Parameter-efficient fine-tuning (PEFT) has emerged as a popular approach for
adapting pre-trained Vision Transformer (ViT) models to downstream
applications. While current PEFT methods achieve parameter efficiency, they
overlook GPU memory and time efficiency during both fine-tuning and inference,
due to the repeated computation of redundant tokens in the ViT architecture.
This falls short of practical requirements for downstream task adaptation. In
this paper, we propose \textbf{Sparse-Tuning}, a novel tuning paradigm that
substantially enhances both fine-tuning and inference efficiency for
pre-trained ViT models. Sparse-Tuning efficiently fine-tunes the pre-trained
ViT by sparsely preserving the informative tokens and merging redundant ones,
enabling the ViT to focus on the foreground while reducing computational costs
on background regions in the images. To accurately distinguish informative
tokens from uninformative ones, we introduce a tailored Dense Adapter, which
establishes dense connections across different encoder layers in the ViT,
thereby enhancing the representational capacity and quality of token
sparsification. Empirical results on VTAB-1K, three complete image datasets,
and two complete video datasets demonstrate that Sparse-Tuning reduces the
GFLOPs to \textbf{62\%-70\%} of the original ViT-B while achieving
state-of-the-art performance. Source code is available at
https://github.com/liuting20/Sparse-Tuning.

Sparse-Tuning 是一种新的调优范式，通过稀疏保存信息标记并合并冗余标记，提高对前景的关注并降低背景区域的计算成本，实现了对预训练的 ViT 模型进行高效的微调和推断，同时具备了现有方法无法满足的 GPU 内存和时间效率要求。

稀疏调整：用高效的微调和推理调整视觉 Transformer

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning  and Inference

This paper addresses the problem of cross-modal object tracking from RGB
videos and event data. Rather than constructing a complex cross-modal fusion
network, we explore the great potential of a pre-trained vision Transformer
(ViT). Particularly, we delicately investigate plug-and-play training
augmentations that encourage the ViT to bridge the vast distribution gap
between the two modalities, enabling comprehensive cross-modal information
interaction and thus enhancing its ability. Specifically, we propose a mask
modeling strategy that randomly masks a specific modality of some tokens to
enforce the interaction between tokens from different modalities interacting
proactively. To mitigate network oscillations resulting from the masking
strategy and further amplify its positive effect, we then theoretically propose
an orthogonal high-rank loss to regularize the attention matrix. Extensive
experiments demonstrate that our plug-and-play training augmentation techniques
can significantly boost state-of-the-art one-stream and twostream trackers to a
large extent in terms of both tracking precision and success rate. Our new
perspective and findings will potentially bring insights to the field of
leveraging powerful pre-trained ViTs to model cross-modal data. The code will
be publicly available.

本研究针对 RGB 视频和事件数据的跨模态物体追踪问题，通过利用预先训练的视觉变换器 (ViT) 的巨大潜力，探索了构建复杂的跨模态融合网络的可能性。特别地，我们精心研究了一种插件式训练增强方法，以鼓励 ViT 填补两种模态之间巨大分布差异，并增强其相互作用，从而提高其能力。具体而言，我们提出了一种掩码建模策略，随机屏蔽一些标记，增加了交互效果，使用高阶模型进行正则化。通过广泛的实验验证，我们的插件式训练增强技术在跟踪精度和成功率等方面极大提升了最先进的单流和双流跟踪器，有望为跨模态数据建模的领域带来新的见解。代码将公开提供。