Although no specific domain knowledge is considered in the design, plain
vision transformers have shown excellent performance in visual recognition
tasks. However, little effort has been made to reveal the potential of such
simple structures for pose estimation tasks. In this paper, we show the
surprisingly good capabilities of plain vision transformers for pose estimation
from various aspects, namely simplicity in model structure, scalability in
model size, flexibility in training paradigm, and transferability of knowledge
between models, through a simple baseline model called ViTPose. Specifically,
ViTPose employs plain and non-hierarchical vision transformers as backbones to
extract features for a given person instance and a lightweight decoder for pose
estimation. It can be scaled up from 100M to 1B parameters by taking the
advantages of the scalable model capacity and high parallelism of transformers,
setting a new Pareto front between throughput and performance. Besides, ViTPose
is very flexible regarding the attention type, input resolution, pre-training
and finetuning strategy, as well as dealing with multiple pose tasks. We also
empirically demonstrate that the knowledge of large ViTPose models can be
easily transferred to small ones via a simple knowledge token. Experimental
results show that our basic ViTPose model outperforms representative methods on
the challenging MS COCO Keypoint Detection benchmark, while the largest model
sets a new state-of-the-art. The code and models are available at
this https URL

本文通过一个名为 ViTPose 的基础模型展示了纯视觉 Transformer 在姿态估计任务中的潜力，该模型结构简单、可扩展、训练方式灵活，并在多关键点检测中取得了优异的性能，其中大模型最高精度达到当前最佳水平。

ViTPose: 用于人体姿势估计的简单视觉 Transformer 基线模型

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Video understanding relies on perceiving the global content and modeling its
internal connections (e.g., causality, movement, and spatio-temporal
correspondence). To learn these interactions, we apply a mask-then-predict
pre-training task on discretized video tokens generated via VQ-VAE. Unlike
language, where the text tokens are more independent, neighboring video tokens
typically have strong correlations (e.g., consecutive video frames usually look
very similar), and hence uniformly masking individual tokens will make the task
too trivial to learn useful representations. To deal with this issue, we
propose a block-wise masking strategy where we mask neighboring video tokens in
both spatial and temporal domains. We also add an augmentation-free contrastive
learning method to further capture the global content by predicting whether the
video clips are sampled from the same video. We pre-train our model on
uncurated videos and show that our pre-trained model can reach state-of-the-art
results on several video understanding datasets (e.g., SSV2, Diving48). Lastly,
we provide detailed analyses on model scalability and pre-training method
design. Code is released at this https URL

该研究基于预测任务以及块状掩码策略，提出一种输入处理策略及无数据扩充方法，以达到在 SSV2、Diving48 等视频理解数据集上实现最先进效果的目的，并对模型伸缩性和预训练方法进行了详细分析。