We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).

本研究聚焦于现有的视觉-语言编码器在多语言语义理解中的不足，提出了一种新的统一训练方法，结合了多种自主开发的技术。研究表明，SigLIP 2在零-shot分类、图像-文本检索以及视觉表示 extraction 等核心能力上超越了之前的版本，同时在本地化和密集预测任务上也取得了显著提升。

SigLIP 2：多语言视觉-语言编码器，提升语义理解、本地化和密集特征