Contrastive learning has emerged as a transformative method for learning
effective visual representations through the alignment of image and text
embeddings. However, pairwise similarity computation in contrastive loss
between image and text pairs poses computational challenges. This paper
presents a novel weakly supervised pre-training of vision models on web-scale
image-text data. The proposed method reframes pre-training on image-text data
as a classification task. Consequently, it eliminates the need for pairwise
similarity computations in contrastive loss, achieving a remarkable $2.7\times$
acceleration in training speed compared to contrastive learning on web-scale
data. Through extensive experiments spanning diverse vision tasks, including
detection and segmentation, we demonstrate that the proposed method maintains
high representation quality. Our source code along with pre-trained model
weights and training recipes is available at
https://github.com/apple/corenet.

通过对网络规模的图像文本数据进行弱监督预训练，本论文提出了一种消除对比损失中成对图像和文本相似性计算的需要的方法，在训练速度上取得了显著的 2.7 倍加速。通过广泛的实验证明，该方法在各种视觉任务中具有高质量的表征。

CatLIP: 在 Web 规模的图文数据上 2.7 倍速度预训练的 CLIP 级别视觉识别准确性

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster  Pre-training on Web-scale Image-Text Data

Recently, Vision-Language Pre-training (VLP) techniques have greatly
benefited various vision-language tasks by jointly learning visual and textual
representations, which intuitively helps in Optical Character Recognition (OCR)
tasks due to the rich visual and textual information in scene text images.
However, these methods cannot well cope with OCR tasks because of the
difficulty in both instance-level text encoding and image-text pair acquisition
(i.e. images and captured texts in them). This paper presents a weakly
supervised pre-training method, oCLIP, which can acquire effective scene text
representations by jointly learning and aligning visual and textual
information. Our network consists of an image encoder and a character-aware
text encoder that extract visual and textual features, respectively, as well as
a visual-textual decoder that models the interaction among textual and visual
features for learning effective scene text representations. With the learning
of textual features, the pre-trained model can attend texts in images well with
character awareness. Besides, these designs enable the learning from weakly
annotated texts (i.e. partial texts in images without text bounding boxes)
which mitigates the data annotation constraint greatly. Experiments over the
weakly annotated images in ICDAR2019-LSVT show that our pre-trained model
improves F-score by +2.5\% and +4.8\% while transferring its weights to other
text detection and spotting networks, respectively. In addition, the proposed
method outperforms existing pre-training techniques consistently across
multiple public datasets (e.g., +3.2\% and +1.3\% for Total-Text and CTW1500).

本文提出了一种弱监督的预训练方法 oCLIP，该方法通过联合学习视觉和文本信息来获取有效的场景文本表示，并能从弱注释文本中学习，可以有效地应对 OCR 任务。实验证明，该方法在多个公共数据集上都优于现有的预训练技术。