The quality of the data and annotation upper-bounds the quality of a
downstream model. While there exist large text corpora and image-text pairs,
high-quality video-text data is much harder to collect. First of all, manual
labeling is more time-consuming, as it requires an annotator to watch an entire
video. Second, videos have a temporal dimension, consisting of several scenes
stacked together, and showing multiple actions. Accordingly, to establish a
video dataset with high-quality captions, we propose an automatic approach
leveraging multimodal inputs, such as textual video description, subtitles, and
individual video frames. Specifically, we curate 3.8M high-resolution videos
from the publicly available HD-VILA-100M dataset. We then split them into
semantically consistent video clips, and apply multiple cross-modality teacher
models to obtain captions for each video. Next, we finetune a retrieval model
on a small subset where the best caption of each video is manually selected and
then employ the model in the whole dataset to select the best caption as the
annotation. In this way, we get 70M videos paired with high-quality text
captions. We dub the dataset as Panda-70M. We show the value of the proposed
dataset on three downstream tasks: video captioning, video and text retrieval,
and text-driven video generation. The models trained on the proposed data score
substantially better on the majority of metrics across all the tasks.

通过多模态输入构建高质量视频数据集，使用检索模型选择最佳字幕注释，名为 Panda-70M，训练模型在视频字幕生成、视频与文本检索等任务上具有优异性能。

Panda-70M：使用多个跨模态教师为 70M 视频加上字幕

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

The recent advance in vision-language models is largely attributed to the
abundance of image-text data. We aim to replicate this success for
video-language models, but there simply is not enough human-curated video-text
data available. We thus resort to fine-tuning a video-language model from a
strong image-language baseline with synthesized instructional data. The
resulting video-language model is then used to auto-label millions of videos to
generate high-quality captions. We show the adapted video-language model
performs well on a wide range of video-language benchmarks. For instance, it
surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our
model generates detailed descriptions for previously unseen videos, which
provide better textual supervision than existing methods. Experiments show that
a video-language dual-encoder model contrastively trained on these
auto-generated captions is 3.8% better than the strongest baseline that also
leverages vision-language models. Our best model outperforms state-of-the-art
methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

本研究利用合成的教学数据对图像语言基准进行微调，生成高质量的视频标题，构建适应视频和语言的模型，并在多个视频 - 语言基准上取得了显著结果。

数百万视频上的视觉语言模型蒸馏

Distilling Vision-Language Models on Millions of Videos

In the realm of large multi-modal models (LMMs), efficient modality alignment
is crucial yet often constrained by the scarcity of high-quality image-text
data. To address this bottleneck, we introduce the ShareGPT4V dataset, a
pioneering large-scale resource featuring 1.2 million highly descriptive
captions, which surpasses existing datasets in diversity and information
content, covering world knowledge, object properties, spatial relationships,
and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated
100K high-quality captions collected from advanced GPT4-Vision and has been
expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V
first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT)
phase, by substituting an equivalent quantity of detailed captions in existing
SFT datasets with a subset of our high-quality captions, significantly
enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME
and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and
2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training
and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple
architecture that has remarkable performance across a majority of the
multi-modal benchmarks. This project is available at
this https URL to serve as a pivotal resource for advancing the
LMMs community.

在大型多模态模型领域，高效的模态对齐对于提升模型性能至关重要，但由于高质量图文数据的稀缺性而受限。为了解决这一瓶颈，我们介绍了 ShareGPT4V 数据集，这是一个包含 120 万条高度描述性的标题的创新大规模资源，其在多样性和信息内容上超越了现有数据集，涵盖了世界知识、对象属性、空间关系和美学评估。具体来说，ShareGPT4V 源于 Advanced GPT4-Vision 收集的 10 万个高质量标题，通过在该子集上进行训练，将其扩展到 120 万个。ShareGPT4V 首先在监督微调（SFT）阶段展示了其有效性，通过用高质量标题子集替换现有 SFT 数据集中等量的详细标题，显著提升了 MME 和 MMBench 基准测试中的 LLaVA-7B、LLaVA-1.5-13B 和 Qwen-VL-Chat-7B 等 LMMs 模型，分别增益为 222.8/22.0/22.3 和 2.7/1.3/1.5。我们进一步将 ShareGPT4V 数据集融入到预训练和 SFT 阶段，获得了 ShareGPT4V-7B，一个基于简单架构的优秀 LMM 模型，其在大多数多模态基准测试上表现出色。该项目可通过此 https 链接获得，以服务于 LMMs 社区的进一步发展。