AbstractRecently,
vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks. Yet the high computational costs and training data requirements of ViTs limit their application in resource-constrained settings.
→