Multimodal alignment between language and vision is the fundamental topic in
current vision-language model research. Contrastive Captioners (CoCa), as a
representative method, integrates Contrastive Language-Image Pretraining (CLIP)
and Image Caption (IC) into a unified framework, resulting in impressive
results. CLIP imposes a bidirectional constraints on global representation of
entire images and sentences. Although IC conducts an unidirectional
image-to-text generation on local representation, it lacks any constraint on
local text-to-image reconstruction, which limits the ability to understand
images at a fine-grained level when aligned with texts. To achieve multimodal
alignment from both global and local perspectives, this paper proposes
Symmetrizing Contrastive Captioners (SyCoCa), which introduces bidirectional
interactions on images and texts across the global and local representation
levels. Specifically, we expand a Text-Guided Masked Image Modeling (TG-MIM)
head based on ITC and IC heads. The improved SyCoCa can further leverage
textual cues to reconstruct contextual images and visual cues to predict
textual contents. When implementing bidirectional local interactions, the local
contents of images tend to be cluttered or unrelated to their textual
descriptions. Thus, we employ an attentive masking strategy to select effective
image patches for interaction. Extensive experiments on five vision-language
tasks, including image-text retrieval, image-captioning, visual question
answering, and zero-shot/finetuned image classification, validate the
effectiveness of our proposed method.

当前视觉语言模型研究的基础主题是语言和视觉之间的多模态对齐。对比式字幕生成器 (CoCa) 是一种代表性方法，它将对比语言 - 图像预训练 (CLIP) 和图像字幕 (IC) 集成到统一框架中，取得了令人印象深刻的结果。本文提出了一种称为对称式对比式字幕生成器 (SyCoCa) 的方法，从全局和局部表达水平上引入图像和文本交互，以实现多模态对齐。在实验中，我们验证了所提出方法的有效性。

SyCoCa: 对称化的关注屏蔽对齐的对比式字幕生成器

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for  Multimodal Alignment

We propose a novel zero-shot approach to computing correspondences between 3D
shapes. Existing approaches mainly focus on isometric and near-isometric shape
pairs (e.g., human vs. human), but less attention has been given to strongly
non-isometric and inter-class shape matching (e.g., human vs. cow). To this
end, we introduce a fully automatic method that exploits the exceptional
reasoning capabilities of recent foundation models in language and vision to
tackle difficult shape correspondence problems. Our approach comprises multiple
stages. First, we classify the 3D shapes in a zero-shot manner by feeding
rendered shape views to a language-vision model (e.g., BLIP2) to generate a
list of class proposals per shape. These proposals are unified into a single
class per shape by employing the reasoning capabilities of ChatGPT. Second, we
attempt to segment the two shapes in a zero-shot manner, but in contrast to the
co-segmentation problem, we do not require a mutual set of semantic regions.
Instead, we propose to exploit the in-context learning capabilities of ChatGPT
to generate two different sets of semantic regions for each shape and a
semantic mapping between them. This enables our approach to match strongly
non-isometric shapes with significant differences in geometric structure.
Finally, we employ the generated semantic mapping to produce coarse
correspondences that can further be refined by the functional maps framework to
produce dense point-to-point maps. Our approach, despite its simplicity,
produces highly plausible results in a zero-shot manner, especially between
strongly non-isometric shapes.

本文提出一种新颖的零样本方法，用于计算 3D 模型之间的对应关系，特别是针对具有很强差异性和不同类别之间匹配的问题，并在零样本情况下使用 language-vision model 方法进行分类，使用 ChatGPT 生成语义映射，并使用 functional maps framework 进一步细化到点对点映射。