Compression of large and performant vision foundation models (VFMs) into arbitrary bit-wise operations (BitOPs) allows their deployment on various hardware. We propose to fine-tune a VFM to a mixed-precision quantized supernet. The supernet-based neural architecture search (NAS) can be adopted for this purpose, which trains a supernet, and then subnets within arbitrary hardware budgets can be extracted. However, existing methods face difficulties in optimizing the mixed-precision search space and incurring large memory costs during training. To tackle these challenges, first, we study the effective search space design for fine-tuning a VFM by comparing different operators (such as resolution, feature size, width, depth, and bit-widths) in terms of performance and BitOPs reduction. Second, we propose memory-efficient supernet training using a low-rank adapter (LoRA) and a progressive training strategy. The proposed method is evaluated for the recently proposed VFM, Segment Anything Model, fine-tuned on segmentation tasks. The searched model yields about a 95% reduction in BitOPs without incurring performance degradation.

对于大型和高性能的视觉基础模型（Vision Foundation Models，VFMs）进行任意位操作（BitOPs）的压缩，以在各种硬件上部署。我们提出了将VFM微调为混合精度量化超网络的方法，该超网络进行神经架构搜索（NAS），可以训练超网络，然后可以提取在任意硬件预算内的子网络。针对现有方法在优化混合精度搜索空间和训练过程中产生大量内存开销方面的困难，我们首先通过比较不同操作符（如分辨率、特征大小、宽度、深度和位宽）的性能和BitOPs减少来研究微调VFM的有效搜索空间设计。其次，我们提出了一种使用低秩适配器（LoRA）和渐进训练策略的内存高效超网络训练方法。该方法在最近提出的VFM（Segment Anything Model）上进行了评估，并在分割任务上微调。搜索出的模型在不降低性能的情况下减少了约95%的BitOPs。

使用低秩适配器从视觉基础模型进行混合精度Supernet训练