In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. This is often done with Byte Pair Encoding. In the context of images, tokenisation of visual data is usually limited to regular grids obtained from quantisation methods, without global content awareness. Our work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression. We achieve this through counting constellations of token pairs and replacing the most frequent token pair with a newly introduced token. The multidimensionality only increases the computation time by a factor of 2 for images, making it applicable even to large datasets like ImageNet within minutes on consumer hardware. This is a lossless preprocessing step. Our evaluation shows improved training and inference performance of transformers on visual data achieved by compressing frequent constellations of tokens: The resulting sequences are shorter, with more uniformly distributed information content, e.g. condensing empty regions in an image into single tokens. As our experiments show, these condensed sequences are easier to process. We additionally introduce a strategy to amplify this compression further by clustering the vocabulary.

本研究解决了视觉数据标记过程中缺乏全球内容感知的问题，提出了一种将字节对编码从一维扩展到多维的新方法。通过计算频繁的标记对并用新标记替换它们，研究表明该方法可以减少序列长度并提高 Transformer 在视觉数据上的训练与推理性能。更重要的是，这种无损的预处理步骤适用于大型数据集，具有显著的计算效率提升。

多维字节对编码：缩短序列以改善视觉数据生成