Large multimodal models trained on natural documents, which interleave images
and text, outperform models trained on image-text pairs on various multimodal
benchmarks that require reasoning over one or multiple images to generate a
text. However, the datasets used to train these models have not been released,
and the collection process has not been fully specified. We introduce the
OBELISC dataset, an open web-scale filtered dataset of interleaved image-text
documents comprising 141 million web pages extracted from Common Crawl, 353
million associated images, and 115 billion text tokens. We describe the dataset
creation process, present comprehensive filtering rules, and provide an
analysis of the dataset's content. To show the viability of OBELISC, we train
an 80 billion parameters vision and language model on the dataset and obtain
competitive performance on various multimodal benchmarks. We release the code
to reproduce the dataset along with the dataset itself.

本文介绍了一个大型多模式模型数据集（OBELISC 数据集），由 141 亿个网页、353 亿个相关图像和 1150 亿个文本标记组成，在此数据集上训练出的模型在各种多模态测试中获得了有竞争力的性能表现。

OBELISC: 一个开放的大规模的筛选过的交错图像文本数据集

OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text  Documents

In this paper we propose a new approach to person re-identification using
images and natural language descriptions. We propose a joint vision and
language model based on CCA and CNN architectures to match across the two
modalities as well as to enrich visual examples for which there are no language
descriptions. We also introduce new annotations in the form of natural language
descriptions for two standard Re-ID benchmarks, namely CUHK03 and VIPeR. We
perform experiments on these two datasets with techniques based on CNN,
hand-crafted features as well as LSTM for analysing visual and natural
description data. We investigate and demonstrate the advantages of using
natural language descriptions compared to attributes as well as CNN compared to
LSTM in the context of Re-ID. We show that the joint use of language and vision
can significantly improve the state-of-the-art performance on standard Re-ID
benchmarks.

我们提出了一种新的人员再识别方法，使用图像和自然语言描述的联合视觉和语言模型，相比属性和 LSTM，使用自然语言描述和 CNN 可以显著提高标准 Re-ID 基准测试的性能。