Despite the promising performance of current 3D human pose estimation techniques, understanding and enhancing their generalization on challenging in-the-wild videos remain an open problem. In this work, we focus on the robustness of 2D-to-3D pose lifters. To this end, we develop two benchmark datasets, namely Human3.6M-C and HumanEva-I-C, to examine the robustness of video-based 3D pose lifters to a wide range of common video corruptions including temporary occlusion, motion blur, and pixel-level noise. We observe the poor generalization of state-of-the-art 3D pose lifters in the presence of corruption and establish two techniques to tackle this issue. First, we introduce Temporal Additive Gaussian Noise (TAGN) as a simple yet effective 2D input pose data augmentation. Additionally, to incorporate the confidence scores output by the 2D pose detectors, we design a confidence-aware convolution (CA-Conv) block. Extensively tested on corrupted videos, the proposed strategies consistently boost the robustness of 3D pose lifters and serve as new baselines for future research.

当前的3D人体姿势估计技术虽然表现出色，但在复杂的野外视频中理解和提升它们的普适性仍然是一个开放的问题。本文聚焦于2D到3D姿势提升器的稳健性，并开发了两个基准数据集来检验视频-based 3D姿势提升器对包括临时遮挡、动态模糊和像素级噪声在内的一系列常见视频污染的稳健性。我们观察到现有的最先进的3D姿势提升器在存在污染的情况下的普适性较差，并提出了两种应对这一问题的技术。首先，我们引入了时间加性高斯噪声 (TAGN) 作为一种简单而有效的2D输入姿势数据增强技术。此外，为了将2D姿势检测器输出的置信度得分纳入考虑，我们设计了一种置信度感知的卷积 (CA-Conv) 块。通过在受损视频上广泛测试，所提出的策略不断提升了3D姿势提升器的稳健性，并为未来研究建立了新的基准。

提升三维人体姿势估计的鲁棒性：一个基准和从嘈杂输入中学习