Rotary Position Embedding (RoPE) performs remarkably on language models,
especially for length extrapolation of Transformers. However, the impacts of
RoPE on computer vision domains have been underexplored, even though RoPE
appears capable of enhancing Vision Transformer (ViT) performance in a way
similar to the language domain. This study provides a comprehensive analysis of
RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D
vision data. The analysis reveals that RoPE demonstrates impressive
extrapolation performance, i.e., maintaining precision while increasing image
resolution at inference. It eventually leads to performance improvement for
ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study
provides thorough guidelines to apply RoPE into ViT, promising improved
backbone performance with minimal extra computational overhead. Our code and
pre-trained models are available at this https URL

使用 RoPE（Rotary Position Embedding）在 Vision Transformer（ViT）中的实际实现对 2D 视觉数据进行了综合分析，结果显示 RoPE 在推理时能够保持精度的同时提高图像分辨率，从而改善 ImageNet-1k、COCO 检测和 ADE-20k 分割的性能。此研究提供了将 RoPE 应用于 ViT 的详细指南，承诺在最小的额外计算开销下提高主干性能。