TL;DR通过 Resizable-ViT 模型和 Token-Length Assigner 方法,在保证准确性的前提下,为每个图像分配最小的适当的 token 长度,从而加快 ViT 模型的推理速度,从而显着降低计算成本。
Abstract
The vision transformer is a model that breaks down each image into a sequence
of tokens with a fixed length and processes them similarly to words in natural
language processing. Although increasing the number of tokens typically results
in better performance, it also leads to a conside