基于小波的图像分词方法用于视觉变压器

May, 2024

基于小波的图像分词方法用于视觉变压器

Wavelet-Based Image Tokenizer for Vision Transformers

Zhenhai Zhu, Radu Soricut

TL;DR基于小波变换的图像分词器提高了训练吞吐量并减少了 ImageNet 验证集的 top-1 误差率，同时为基于 ViT 模型设计提供了新的研究方向。

Abstract

Non-overlapping patch-wise convolution is the default image tokenizer for all state-of-the-art vision transformer (ViT) models. Even though many ViT variants have been proposed to improve its efficiency and accur