Recently, several studies have reported on the fine-tuning of foundation
models for image-text modeling in the field of medicine, utilizing images from
online data sources such as Twitter and PubMed. Foundation models are large,
deep artificial neural networks capable of learning the context of a specific
domain through training on exceptionally extensive datasets. Through
validation, we have observed that the representations generated by such models
exhibit inferior performance in retrieval tasks within digital pathology when
compared to those generated by significantly smaller, conventional deep
networks.

最近，在医学领域中，有几项研究报道了利用像推特和 PubMed 这样的在线数据来源中的图像对基础模型进行微调以进行图像 - 文本建模。基础模型是能够通过在非常广泛的数据集上训练来学习特定领域上下文的大型深度人工神经网络。通过验证，我们观察到，与显著较小的传统深度网络生成的表示相比，这些模型生成的表示在数字病理学的检索任务中表现出较差的性能。

何时称为基础模型的基础模型

When is a Foundation Model a Foundation Model

Multimodal representation learning has shown promising improvements on
various vision-language tasks. Most existing methods excel at building
global-level alignment between vision and language while lacking effective
fine-grained image-text interaction. In this paper, we propose a jointly masked
multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both
implicit and explicit targets for the masked signals to recover. The implicit
target provides a unified and debiased objective for vision and language, where
the model predicts latent multimodal representations of the unmasked input. The
explicit target further enriches the multimodal representations by recovering
high-level and semantically meaningful information: momentum visual features of
image patches and concepts of word tokens. Through such a masked modeling
process, our model not only learns fine-grained multimodal interaction, but
also avoids the semantic gap between high-level representations and low- or
mid-level prediction targets (e.g. image pixels), thus producing semantically
rich multimodal representations that perform well on both zero-shot and
fine-tuned settings. Our pre-trained model (named MAMO) achieves
state-of-the-art performance on various downstream vision-language tasks,
including image-text retrieval, visual question answering, visual reasoning,
and weakly-supervised visual grounding.

本文提出一种联合掩蔽多模态建模方法 (MAMO)，通过联合掩盖图像 - 文本输入，并通过隐式和显式目标来恢复掩蔽信号，从而学习细粒度的多模态表示，实现高级和语义明确的信息恢复，取得了各种下游视觉 - 语言任务中的最新成果。

MAMO: 面向细粒度视觉语言表征学习的遮蔽多模态建模

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

We present a new human-human dialogue dataset - PhotoChat, the first dataset
that casts light on the photo sharing behavior in onlin emessaging. PhotoChat
contains 12k dialogues, each of which is paired with a user photo that is
shared during the conversation. Based on this dataset, we propose two tasks to
facilitate research on image-text modeling: a photo-sharing intent prediction
task that predicts whether one intends to share a photo in the next
conversation turn, and a photo retrieval task that retrieves the most relevant
photo according to the dialogue context. In addition, for both tasks, we
provide baseline models using the state-of-the-art models and report their
benchmark performances. The best image retrieval model achieves 10.4% recall@1
(out of 1000 candidates) and the best photo intent prediction model achieves
58.1% F1 score, indicating that the dataset presents interesting yet
challenging real-world problems. We are releasing PhotoChat to facilitate
future research work among the community.

本研究提出了一个新的人对人对话数据集 - PhotoChat，该数据集是第一个关注于在线消息中照片分享行为的数据集，其中包含 12k 个对话。基于此数据集，我们提出了两个任务以促进图像文本建模的研究：一个是用于预测下一个对话回合中是否打算分享照片的照片分享意图预测任务，另一个是根据对话上下文检索最相关的照片的照片检索任务。在这两个任务中，我们使用最先进的模型提供基线模型，并报告其基准表现。最佳图像检索模型实现了 10.4％的召回率 @ 1（1000 个候选项之一），而最佳照片意图预测模型达到 58.1％的 F1 分数，表明该数据集提供了有趣而具有挑战性的现实问题。我们发布了 PhotoChat，以促进未来的研究工作。