While contrastive language image pretraining (CLIP) have exhibited impressive
performance by learning highly semantic and generalized representations, recent
works have exposed a fundamental drawback in its syntactic properties, that
includes interpreting fine-grained attributes, actions, spatial relations,
states, and details that require compositional reasoning. One reason for this
is that natural captions often do not capture all the visual details of a
scene. This leads to unaddressed visual concepts being misattributed to the
wrong words. And the pooled image and text features, ends up acting as a bag of
words, hence losing the syntactic information. In this work, we ask: Is it
possible to enhance CLIP's fine-grained and syntactic abilities without
compromising its semantic properties? We show that this is possible by adapting
CLIP efficiently on a high-quality, comprehensive, and relatively small
dataset. We demonstrate our adaptation strategy on VidSitu, a video situation
recognition dataset annotated with verbs and rich semantic role labels (SRL).
We use the SRL and verb information to create rule-based detailed captions,
making sure they capture most of the visual concepts. Combined with hard
negatives and hierarchical losses, these annotations allow us to learn a
powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that
preserves semantic understanding while being detail-oriented. We evaluate on
five diverse vision-language tasks in both fine-tuning and zero-shot settings,
achieving consistent improvements over the base CLIP model.

通过基于 VidSitu 数据集的细节导向字幕和层级损失，我们改进了 contrastive language image pretraining (CLIP) 模型，提高了其对细粒度和句法的理解能力，并在不同任务中取得了稳定的改进。

FiGCLIP: 细粒度 CLIP 适应通过密集标注视频

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

In the era of modern healthcare, swiftly generating medical question
summaries is crucial for informed and timely patient care. Despite the
increasing complexity and volume of medical data, existing studies have focused
solely on text-based summarization, neglecting the integration of visual
information. Recognizing the untapped potential of combining textual queries
with visual representations of medical conditions, we introduce the Multimodal
Medical Question Summarization (MMQS) Dataset. This dataset, a major
contribution to our work, pairs medical queries with visual aids, facilitating
a richer and more nuanced understanding of patient needs. We also propose a
framework, utilizing the power of Contrastive Language Image Pretraining(CLIP)
and Large Language Models(LLMs), consisting of four modules that identify
medical disorders, generate relevant context, filter medical concepts, and
craft visually aware summaries. Our comprehensive framework harnesses the power
of CLIP, a multimodal foundation model, and various general-purpose LLMs,
comprising four main modules: the medical disorder identification module, the
relevant context generation module, the context filtration module for
distilling relevant medical concepts and knowledge, and finally, a
general-purpose LLM to generate visually aware medical question summaries.
Leveraging our MMQS dataset, we showcase how visual cues from images enhance
the generation of medically nuanced summaries. This multimodal approach not
only enhances the decision-making process in healthcare but also fosters a more
nuanced understanding of patient queries, laying the groundwork for future
research in personalized and responsive medical care

在现代医疗时代，迅速生成医疗问题摘要对知情和及时的患者护理至关重要。本文介绍了多模态医疗问题摘要（MMQS）数据集，该数据集将医疗查询与图像辅助相结合，便于更丰富、更细致地理解患者需求。我们提出了一个基于 Contrastive Language Image Pretraining (CLIP) 和 Large Language Models (LLMs) 的框架，包括四个模块，用于识别医疗障碍、生成相关上下文、过滤医疗概念和制作具有视觉感知的摘要。通过利用我们的 MMQS 数据集，展示了图像视觉线索如何增强医学细致摘要的生成。这种多模态方法不仅提升了医疗决策过程，还促进了对患者查询的更细致理解，为个性化和响应式医疗护理的未来研究奠定了基础。

CLIP 和 LLM 在医疗中的多模态问题摘要

CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization  in Healthcare

Contrastive language image pretraining (CLIP) is a standard method for
training vision-language models. While CLIP is scalable, promptable, and robust
to distribution shifts on image classification tasks, it lacks object
localization capabilities. This paper studies the following question: Can we
augment CLIP training with task-specific vision models from model zoos to
improve its visual representations? Towards this end, we leverage open-source
task-specific vision models to generate pseudo-labels for an uncurated and
noisy image-text dataset. Subsequently, we train CLIP models on these
pseudo-labels in addition to the contrastive training on image and text pairs.
This simple setup shows substantial improvements of up to 16.3% across
different vision tasks, including segmentation, detection, depth estimation,
and surface normal estimation. Importantly, these enhancements are achieved
without compromising CLIP's existing capabilities, including its proficiency in
promptable zero-shot classification.

通过在 CLIP 训练中结合任务特定的视觉模型，利用伪标签来改进其视觉表示，该简单的设置在不妨碍现有性能的前提下，显著提高了不同视觉任务的效果。

CLIP 融合模型库专家：视觉增强的伪监督

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

Text-based Person Search (TBPS) aims to retrieve the person images using
natural language descriptions. Recently, Contrastive Language Image Pretraining
(CLIP), a universal large cross-modal vision-language pre-training model, has
remarkably performed over various cross-modal downstream tasks due to its
powerful cross-modal semantic learning capacity. TPBS, as a fine-grained
cross-modal retrieval task, is also facing the rise of research on the
CLIP-based TBPS. In order to explore the potential of the visual-language
pre-training model for downstream TBPS tasks, this paper makes the first
attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus
contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the
TBPS community. We revisit critical design considerations under CLIP, including
data augmentation and loss function. The model, with the aforementioned designs
and practical training tricks, can attain satisfactory performance without any
sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in
model generalization and model compression, demonstrating the effectiveness of
TBPS-CLIP from various aspects. This work is expected to provide empirical
insights and highlight future CLIP-based TBPS research.

基于 Contrastive Language Image Pretraining 的 TBPS 模型设计及研究，提供对 CLIP-based TBPS 任务的全面实证研究以及一个强大的 TBPS-CLIP 基准模型。