Vision-language models have emerged as a powerful tool for previously
challenging multi-modal classification problem in the medical domain. This
development has led to the exploration of automated image description
generation for multi-modal clinical scans, particularly for radiology report
generation. Existing research has focused on clinical descriptions for specific
modalities or body regions, leaving a gap for a model providing entire-body
multi-modal descriptions. In this paper, we address this gap by automating the
generation of standardized body station(s) and list of organ(s) across the
whole body in multi-modal MR and CT radiological images. Leveraging the
versatility of the Contrastive Language-Image Pre-training (CLIP), we refine
and augment the existing approach through multiple experiments, including
baseline model fine-tuning, adding station(s) as a superset for better
correlation between organs, along with image and language augmentations. Our
proposed approach demonstrates 47.6% performance improvement over baseline
PubMedCLIP.

使用多模态的医学影像，利用视觉语言模型 (CLIP) 自动生成整体身体的标准化分区和器官列表，相较于基线模型 (PubMedCLIP)，提高性能达到 47.6%。

CLIP 中的语言增强技术对多模态医学图像的改进解剖检测

Language Augmentation in CLIP for Improved Anatomy Detection on  Multi-modal Medical Images

Making each modality in multi-modal data contribute is of vital importance to
learning a versatile multi-modal model. Existing methods, however, are often
dominated by one or few of modalities during model training, resulting in
sub-optimal performance. In this paper, we refer to this problem as modality
bias and attempt to study it in the context of multi-modal classification
systematically and comprehensively. After stepping into several empirical
analysis, we recognize that one modality affects the model prediction more just
because this modality has a spurious correlation with instance labels. In order
to primarily facilitate the evaluation on the modality bias problem, we
construct two datasets respectively for the colored digit recognition and video
action recognition tasks in line with the Out-of-Distribution (OoD) protocol.
Collaborating with the benchmarks in the visual question answering task, we
empirically justify the performance degradation of the existing methods on
these OoD datasets, which serves as evidence to justify the modality bias
learning. In addition, to overcome this problem, we propose a plug-and-play
loss function method, whereby the feature space for each label is adaptively
learned according to the training set statistics. Thereafter, we apply this
method on eight baselines in total to test its effectiveness. From the results
on four datasets regarding the above three tasks, our method yields remarkable
performance improvements compared with the baselines, demonstrating its
superiority on reducing the modality bias problem.

本文研究了在多模态分类系统中影响模型性能的模态偏差问题，通过构建两个基于 Out-of-Distribution 协议的数据集和提出一种自适应的 plug-and-play 损失函数方法，在彩色数字识别、视频动作识别和视觉问答三个任务上实现了明显的性能改进，证明了该方法在减少模态偏差问题方面的优越性。

关于模态偏差的识别和减少

On Modality Bias Recognition and Reduction

Prior work has studied different visual modalities in isolation and developed
separate architectures for recognition of images, videos, and 3D data. Instead,
in this paper, we propose a single model which excels at classifying images,
videos, and single-view 3D data using exactly the same model parameters. Our
'Omnivore' model leverages the flexibility of transformer-based architectures
and is trained jointly on classification tasks from different modalities.
Omnivore is simple to train, uses off-the-shelf standard datasets, and performs
at-par or better than modality-specific models of the same size. A single
Omnivore model obtains 86.0% on ImageNet, 84.1% on Kinetics, and 67.1% on SUN
RGB-D. After finetuning, our models outperform prior work on a variety of
vision tasks and generalize across modalities. Omnivore's shared visual
representation naturally enables cross-modal recognition without access to
correspondences between modalities. We hope our results motivate researchers to
model visual modalities together.

该研究提出了一种基于 Transformer 的 'Omnivore' 模型，使用相同的模型参数在图像、视频、单视角 3D 数据上进行多模态分类，达到与性能同等或更好的效果，并自然地实现了跨模态识别。

Omnivore: 许多视觉模式的单个模型

Omnivore: A Single Model for Many Visual Modalities

While the incipient internet was largely text-based, the modern digital world
is becoming increasingly multi-modal. Here, we examine multi-modal
classification where one modality is discrete, e.g. text, and the other is
continuous, e.g. visual representations transferred from a convolutional neural
network. In particular, we focus on scenarios where we have to be able to
classify large quantities of data quickly. We investigate various methods for
performing multi-modal fusion and analyze their trade-offs in terms of
classification accuracy and computational efficiency. Our findings indicate
that the inclusion of continuous information improves performance over
text-only on a range of multi-modal classification tasks, even with simple
fusion methods. In addition, we experiment with discretizing the continuous
features in order to speed up and simplify the fusion process even further. Our
results show that fusion with discretized features outperforms text-only
classification, at a fraction of the computational cost of full multi-modal
fusion, with the additional benefit of improved interpretability.

本文研究多模态分类问题，其中一种模态是离散的文本，另一种模态是连续的视觉表示，我们针对需要快速分类大量数据的情况进行分析，提出了多种方法进行多模态融合，并分析了它们在分类精度和计算效率上的权衡。结果表明，连续信息的引入可以有效提高多模态分类任务的性能，并克服了融合过程的复杂性和消耗的计算资源。此外，本文还介绍了一种分割连续特征以进一步加速和简化融合过程的方法，在提高解释性的同时超越了单一文本分类的精度。