Instruction following is crucial in contemporary LLM. However, when extended
to multimodal setting, it often suffers from misalignment between specific
textual instruction and targeted local region of an image. To achieve more
accurate and nuanced multimodal instruction following, we introduce
Instruction-guided Visual Masking (IVM), a new versatile visual grounding model
that is compatible with diverse multimodal models, such as LMM and robot model.
By constructing visual masks for instruction-irrelevant regions, IVM-enhanced
multimodal models can effectively focus on task-relevant image regions to
better align with complex instructions. Specifically, we design a visual
masking data generation pipeline and create an IVM-Mix-1M dataset with 1
million image-instruction pairs. We further introduce a new learning technique,
Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training
that prioritizes high-quality data samples. Experimental results on generic
multimodal tasks such as VQA and embodied robotic control demonstrate the
versatility of IVM, which as a plug-and-play tool, significantly boosts the
performance of diverse multimodal models, yielding new state-of-the-art results
across challenging multimodal benchmarks. Code is available at
this https URL

通过引入指导型视觉遮罩（IVM）来改进多模式指令跟踪，本研究在多模式设置下证明了 IVM 的适用性，并显示出在图像与指令之间进行准确的视觉对齐的优势。通过构建视觉遮罩，IVM 增强的多模式模型能够更好地关注与任务相关的图像区域，从而取得更好的指令跟踪表现。实验结果表明，IVM 作为一种即插即用工具，显著提升了多样化的多模式模型性能，在各种复杂多模式基准上取得了新的最佳结果。

指令引导下的视觉遮罩化

Instruction-Guided Visual Masking

While different neural models often exhibit latent spaces that are alike when
exposed to semantically related data, this intrinsic similarity is not always
immediately discernible. Towards a better understanding of this phenomenon, our
work shows how representations learned from these neural modules can be
translated between different pre-trained networks via simpler transformations
than previously thought. An advantage of this approach is the ability to
estimate these transformations using standard, well-understood algebraic
procedures that have closed-form solutions. Our method directly estimates a
transformation between two given latent spaces, thereby enabling effective
stitching of encoders and decoders without additional training. We extensively
validate the adaptability of this translation procedure in different
experimental settings: across various trainings, domains, architectures (e.g.,
ResNet, CNN, ViT), and in multiple downstream tasks (classification,
reconstruction). Notably, we show how it is possible to zero-shot stitch text
encoders and vision decoders, or vice-versa, yielding surprisingly good
classification performance in this multimodal setting.

通过简单的转换，我们的研究展示了神经网络模型中学习到的表示可以在不同的预训练网络之间进行转化，从而有效地连接编码器和解码器，并实现在多模态设置下的出色分类性能。

语义对齐下的潜在空间翻译

Latent Space Translation via Semantic Alignment

Recent advancements in diffusion models have enabled the generation of
realistic deepfakes by writing textual prompts in natural language. While these
models have numerous benefits across various sectors, they have also raised
concerns about the potential misuse of fake images and cast new pressures on
fake image detection. In this work, we pioneer a systematic study of the
authenticity of fake images generated by state-of-the-art diffusion models.
Firstly, we conduct a comprehensive study on the performance of contrastive and
classification-based visual features. Our analysis demonstrates that fake
images share common low-level cues, which render them easily recognizable.
Further, we devise a multimodal setting wherein fake images are synthesized by
different textual captions, which are used as seeds for a generator. Under this
setting, we quantify the performance of fake detection strategies and introduce
a contrastive-based disentangling strategy which let us analyze the role of the
semantics of textual descriptions and low-level perceptual cues. Finally, we
release a new dataset, called COCOFake, containing about 600k images generated
from original COCO images.

本文通过系统研究最新扩散模型生成的虚假图片的真实性，分析其图像的低层特征和用作种子的文本说明的语义作用，并提供了一个新的包含约 600k 张图像的数据集 COCOFake。

家长和孩子：区分多模态深度伪造图像与自然图像

Parents and Children: Distinguishing Multimodal DeepFakes from Natural  Images

In this paper, we introduce Key-Value Memory Networks to a multimodal setting
and a novel key-addressing mechanism to deal with sequence-to-sequence models.
The proposed model naturally decomposes the problem of video captioning into
vision and language segments, dealing with them as key-value pairs. More
specifically, we learn a semantic embedding (v) corresponding to each frame (k)
in the video, thereby creating (k, v) memory slots. We propose to find the next
step attention weights conditioned on the previous attention distributions for
the key-value memory slots in the memory addressing schema. Exploiting this
flexibility of the framework, we additionally capture spatial dependencies
while mapping from the visual to semantic embedding. Experiments done on the
Youtube2Text dataset demonstrate usefulness of recurrent key-addressing, while
achieving competitive scores on BLEU@4, METEOR metrics against state-of-the-art
models.

本文提出了 Key-Value Memory Networks 应用于多模态设置的方法，以及一种新的键寻址机制，将视频字幕生成问题自然地分解为视觉和语言端，将其作为键 - 值对处理，并在寻址模式下提出了一种递归关注的方法来捕捉语境信息，通过实验发现，这种方法可以提高 BLEU@4，METEOR 得分，并实现了与最先进方法竞争性能。