Despite impressive advancements in multimodal compositional reasoning
approaches, they are still limited in their flexibility and efficiency by
processing fixed modality inputs while updating a lot of model parameters. This
paper tackles these critical challenges and proposes CREMA, an efficient and
modular modality-fusion framework for injecting any new modality into video
reasoning. We first augment multiple informative modalities (such as optical
flow, 3D point cloud, audio) from given videos without extra human annotation
by leveraging existing pre-trained models. Next, we introduce a query
transformer with multiple parameter-efficient modules associated with each
accessible modality. It projects diverse modality features to the LLM token
embedding space, allowing the model to integrate different data types for
response generation. Furthermore, we propose a fusion module designed to
compress multimodal queries, maintaining computational efficiency in the LLM
while combining additional modalities. We validate our method on video-3D,
video-audio, and video-language reasoning tasks and achieve better/equivalent
performance against strong multimodal LLMs, including BLIP-2, 3D-LLM, and
SeViLA while using 96% fewer trainable parameters. We provide extensive
analyses of CREMA, including the impact of each modality on reasoning domains,
the design of the fusion module, and example visualizations.

本文提出了一种高效的模态融合框架 CREMA，用于将任何新的模态注入视频推理，通过使用现有的预训练模型增强给定视频的多个信息模态，然后引入一个与每个可访问模态相关的多个参数高效模块的查询转换器，将不同的数据类型整合到响应产生的 LLM 令牌嵌入空间，同时提出了一个压缩多模态查询的融合模块，在维持 LLM 的计算效率的同时结合额外的模态，通过充分验证了在视频 - 3D、视频 - 音频和视频 - 语言推理任务上的性能，显示了优于其他强大的多模态 LLMs（包括 BLIP-2、3D-LLM 和 SeViLA）的表现，并使用了 96% 较少的可训练参数。

CREMA: 多模态组合视频推理的高效模块适应与融合

CREMA: Multimodal Compositional Video Reasoning via Efficient Modular  Adaptation and Fusion

Contemporary large-scale visual language models (VLMs) exhibit strong
representation capacities, making them ubiquitous for enhancing image and text
understanding tasks. They are often trained in a contrastive manner on a large
and diverse corpus of images and corresponding text captions scraped from the
internet. Despite this, VLMs often struggle with compositional reasoning tasks
which require a fine-grained understanding of the complex interactions of
objects and their attributes. This failure can be attributed to two main
factors: 1) Contrastive approaches have traditionally focused on mining
negative examples from existing datasets. However, the mined negative examples
might not be difficult for the model to discriminate from the positive. An
alternative to mining would be negative sample generation 2) But existing
generative approaches primarily focus on generating hard negative texts
associated with a given image. Mining in the other direction, i.e., generating
negative image samples associated with a given text has been ignored. To
overcome both these limitations, we propose a framework that not only mines in
both directions but also generates challenging negative samples in both
modalities, i.e., images and texts. Leveraging these generative hard negative
samples, we significantly enhance VLMs' performance in tasks involving
multimodal compositional reasoning. Our code and dataset are released at
this https URL

通过挖掘负样本并生成具有挑战性的负样本，在两种模态（图像和文本）中显著提高大规模视觉语言模型在多模态组合推理任务中的性能。

增强视觉语言模型的多模态组合推理能力：使用生成式负样本挖掘

Enhancing Multimodal Compositional Reasoning of Visual Language Models  with Generative Negative Mining

With the success of Large Language Models (LLMs), a surge of Generative
Vision-Language Models (GVLMs) have been constructed via multimodal instruction
tuning. The tuning recipe substantially deviates from the common contrastive
vision-language learning. However, the performance of GVLMs in multimodal
compositional reasoning remains largely unexplored, as existing evaluation
metrics and benchmarks focus predominantly on assessing contrastive models like
CLIP. In this paper, we examine the potential evaluation metrics to assess the
GVLMs and hypothesize generative score methods are suitable for evaluating
compositionality. In addition, current benchmarks tend to prioritize syntactic
correctness over semantics. The presence of morphological bias in these
benchmarks can be exploited by GVLMs, leading to ineffective evaluations. To
combat this, we define a MorphoBias Score to quantify the morphological bias
and propose a novel LLM-based strategy to calibrate the bias. Moreover, a
challenging task is added to evaluate the robustness of GVLMs against inherent
inclination toward syntactic correctness. We include the calibrated dataset and
the task into a new benchmark, namely MOrphologicall De-biased Benchmark
(MODE). Our study provides the first unbiased benchmark for the
compositionality of GVLMs, facilitating future research in this direction. We
will release our code and datasets.

使用多模态指导调整的大型语言模型和生成型视觉语言模型，通过评估指标和基准测试提供了第一个无偏向性的复合性测评基准，从而开创了未来研究的方向。