The quadratic computational complexity to the number of tokens limits the practical applications of vision transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) appl