TL;DR基于小波变换的图像分词器提高了训练吞吐量并减少了 ImageNet 验证集的 top-1 误差率,同时为基于 ViT 模型设计提供了新的研究方向。
Abstract
Non-overlapping patch-wise convolution is the default image tokenizer for all state-of-the-art vision transformer (ViT) models. Even though many ViT variants have been proposed to improve its efficiency and accur