TL;DR通过基于 VidSitu 数据集的细节导向字幕和层级损失,我们改进了 contrastive language image pretraining (CLIP) 模型,提高了其对细粒度和句法的理解能力,并在不同任务中取得了稳定的改进。
Abstract
While contrastive language image pretraining (CLIP) have exhibited impressive performance by learning highly semantic and generalized representations, recent works have exposed a fundamental drawback in its syntactic pr