Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable
success in computer vision and particularly demonstrated superior robustness to
distribution shifts of 2D images. However, their robustness under 3D viewpoint
variations is still limited, which can hinder the development for real-world
applications. This paper successfully addresses this concern while keeping
VLPs' original performance by breaking through two primary obstacles: 1) the
scarcity of training data and 2) the suboptimal fine-tuning paradigms. To
combat data scarcity, we build the Multi-View Caption (MVCap) dataset -- a
comprehensive collection of over four million multi-view image-text pairs
across more than 100K objects, providing more potential for VLP models to
develop generalizable viewpoint-invariant representations. To address the
limitations of existing paradigms in performance trade-offs and training
efficiency, we design a novel fine-tuning framework named Omniview-Tuning
(OVT). Specifically, OVT introduces a Cross-Viewpoint Alignment objective
through a minimax-like optimization strategy, which effectively aligns
representations of identical objects from diverse viewpoints without causing
overfitting. Additionally, OVT fine-tunes VLP models in a parameter-efficient
manner, leading to minimal computational cost. Extensive experiments on various
VLP models with different architectures validate that OVT significantly
improves the models' resilience to viewpoint shifts and keeps the original
performance, establishing a pioneering standard for boosting the viewpoint
invariance of VLP models.

通过多视角训练数据集和架构优化，本论文成功改进了视觉语言预训练模型 (VLP) 在三维视角变化下的鲁棒性，提高了其对视角变化的不变性能力。