TL;DR本文介绍了一种名为 Knight 的基于 K 最近邻跨模态映射的零样本图像和视频描述生成方法,利用文本无监督训练实现了图像和视频描述的最新零样本表现。
Abstract
With the development of vision-language pre-training models (VLPMs)
represented by clip and ALIGN, significant breakthroughs have been achieved for
association-based visual tasks such as image classification and