AbstractThe success of
vision transformer (ViT) has been widely reported on a wide range of image recognition tasks. The merit of ViT over CNN has been largely attributed to large training datasets or auxiliary pre-training. Without pre-training, the performance of ViT on
→