Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy
web-crawled datasets. This underscores the critical need for dataset pruning,
as the quality of these datasets is strongly correlated with the performance of
VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train
models using highly-aligned samples is one of the most successful methods for
pruning.We argue that this approach suffers from multiple limitations
including: 1) false positives due to spurious correlations captured by the
pretrained CLIP model, 2) false negatives due to poor discrimination between
hard and bad samples, and 3) biased ranking towards samples similar to the
pretrained CLIP dataset. We propose a pruning method, SIEVE, that employs
synthetic captions generated by image-captioning models pretrained on small,
diverse, and well-aligned image-text pairs to evaluate the alignment of noisy
image-text pairs. To bridge the gap between the limited diversity of generated
captions and the high diversity of alternative text (alt-text), we estimate the
semantic textual similarity in the embedding space of a language model
pretrained on billions of sentences. Using DataComp, a multimodal dataset
filtering benchmark, we achieve state-of-the-art performance on the large scale
pool, and competitive results on the medium scale pool, surpassing
CLIPScore-based filtering by 1.7% and 2.6% on average, on 38 downstream tasks.

使用图像 - 文本模型预训练的数据集来对模型进行裁剪是一种成功的方法，但它存在一些限制。为解决这些问题，我们提出了一种名为 SIEVE 的方法，它使用由经过预训练的图像 - 文本模型生成的合成标题来评估图像 - 文本对的一致性，并在大规模和中等规模的数据集上取得了最先进的性能。