Modern computer vision pipelines handle large images in one of two sub-optimal ways: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. There are many downstream applications in which global context matters as much as high frequency details, such as in real-world satellite imagery; in such cases researchers have to make the uncomfortable choice of which information to discard. We introduce xT, a simple framework for vision transformers which effectively aggregates global context with local details and can model large images end-to-end on contemporary GPUs. We select a set of benchmark datasets across classic vision tasks which accurately reflect a vision model's ability to understand truly large images and incorporate fine details over large scales and assess our method's improvement on them. By introducing a nested tokenization scheme for large images in conjunction with long-sequence length models normally used for natural language processing, we are able to increase accuracy by up to 8.6% on challenging classification tasks and $F_1$ score by 11.6 on context-dependent segmentation in large images.

现代计算机视觉流水线处理大图像的方式可以分为两种：降采样或裁剪。但这两种方法都会导致图像中信息和上下文的严重损失。我们引入了一个名为xT的简单框架，可以在当今的GPU上端到端地对大图像进行全局上下文与局部细节的有效聚合建模。我们选择了一组经典视觉任务的基准数据集来准确反映视觉模型在理解真实大图像、融合大尺度细节方面的能力，并评估了我们的方法在这些任务上的改进效果。通过在处理自然语言的长序列模型中引入针对大图像的嵌套分词方案，我们能够在具有挑战性的分类任务中将准确度提高多达8.6％，并将$F_1$分数提高11.6％，适用于大图像中的上下文相关分割。

xT：用于大图像中更大上下文的嵌套标记化