Cross-modal contrastive learning in vision language pretraining (VLP) faces
the challenge of (partial) false negatives. In this paper, we study this
problem from the perspective of Mutual Information (MI) optimization. It is
common sense that InfoNCE loss used in contrastive learning will maximize the
lower bound of MI between anchors and their positives, while we theoretically
prove that MI involving negatives also matters when noises commonly exist.
Guided by a more general lower bound form for optimization, we propose a
contrastive learning strategy regulated by progressively refined cross-modal
similarity, to more accurately optimize MI between an image/text anchor and its
negative texts/images instead of improperly minimizing it. Our method performs
competitively on four downstream cross-modal tasks and systematically balances
the beneficial and harmful effects of (partial) false negative samples under
theoretical guidance.

本文从互信息优化的角度研究了负样本对视觉语言预训练中交叉模态对比学习的影响，并提出了一种渐进式改进的交叉模态相似度对比学习策略，在理论指导下实现了对 (部分) 假负例样本有益和有害效应的平衡，这种方法在四个下游交叉模态任务中表现良好。

利用伪造的图像标题进行多模态摘要

Exploiting Pseudo Image Captions for Multimodal Summarization

本文从互信息（MI）优化的角度研究了预训练中视觉语言交互 (VLP) 面临的（部分）误负样本的挑战，并提出了一种被逐步优化的跨模态相似性约束下的对比学习策略来更加准确地优化图像 / 文本锚点与其负样本的 MI，从而在四个下游跨模态任务中具有竞争力，平衡了（部分）误负样本的有益和有害效果。