BriefGPT.xyz
Jan, 2024
数百万视频上的视觉语言模型蒸馏
Distilling Vision-Language Models on Millions of Videos
HTML
PDF
Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu...
TL;DR
本研究利用合成的教学数据对图像语言基准进行微调,生成高质量的视频标题,构建适应视频和语言的模型,并在多个视频-语言基准上取得了显著结果。
Abstract
The recent advance in
vision-language models
is largely attributed to the abundance of image-text data. We aim to replicate this success for
video-language models
, but there simply is not enough human-curated vid
→