Large Language Models have demonstrated remarkable performance across various
tasks, exhibiting the capacity to swiftly acquire new skills, such as through
In-Context Learning (ICL) with minimal demonstration examples. In this work, we
present a comprehensive framework for investigating Multimodal ICL (M-ICL) in
the context of Large Multimodal Models. We consider the best open-source
multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal
tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily
relies on text-driven mechanisms, showing little to no influence from the image
modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not
better than a simple strategy based on majority voting over context examples.
Moreover, we identify several biases and limitations of M-ICL that warrant
consideration prior to deployment. Code available at
this https URL}{gitlab.com/folbaeni/multimodal-icl

通过对大型多模态模型的多模态 ICL 的研究，我们发现 M-ICL 主要依赖于文本驱动机制，几乎不受图像模态的影响。当与高级 ICL 策略（如 RICES）一起使用时，M-ICL 并不比基于大多数投票的上下文示例简单策略更好，此外，我们还发现了几种 M-ICL 的偏见和局限性，值得在部署之前考虑。

多模态上下文学习的关键是什么？

What Makes Multimodal In-Context Learning Work?

In order to build self-consistent personalized dialogue agents, previous
research has mostly focused on textual persona that delivers personal facts or
personalities. However, to fully describe the multi-faceted nature of persona,
image modality can help better reveal the speaker's personal characteristics
and experiences in episodic memory (Rubin et al., 2003; Conway, 2009). In this
work, we extend persona-based dialogue to the multimodal domain and make two
main contributions. First, we present the first multimodal persona-based
dialogue dataset named MPCHAT, which extends persona with both text and images
to contain episodic memories. Second, we empirically show that incorporating
multimodal persona, as measured by three proposed multimodal persona-grounded
dialogue tasks (i.e., next response prediction, grounding persona prediction,
and speaker identification), leads to statistically significant performance
improvements across all tasks. Thus, our work highlights that multimodal
persona is crucial for improving multimodal dialogue comprehension, and our
MPCHAT serves as a high-quality resource for this research.

本研究旨在通过引入图像模态转化多维人物角色的性格特点和经验，探究多模态人物角色在对话中的应用和作用，并通过多个任务的实验证明，多模态人物角色的引入可以显著提升多模态对话的性能表现。

MPCHAT: 面向多模态角色驱动的对话

MPCHAT: Towards Multimodal Persona-Grounded Conversation

Transformer, a model comprising attention-based encoder-decoder architecture,
have gained prevalence in the field of natural language processing (NLP) and
recently influenced the computer vision (CV) space. The similarities between
computer vision and medical imaging, reviewed the question among researchers if
the impact of transformers on computer vision be translated to medical imaging?
In this paper, we attempt to provide a comprehensive and recent review on the
application of transformers in medical imaging by; describing the transformer
model comparing it with a diversity of convolutional neural networks (CNNs),
detailing the transformer based approaches for medical image classification,
segmentation, registration and reconstruction with a focus on the image
modality, comparing the performance of state-of-the-art transformer
architectures to best performing CNNs on standard medical datasets.

本文综述 Transformer 模型在医学图像处理中的应用，包括使用注意力机制编码 - 解码结构的 Transformer 模型与卷积神经网络的比较，基于 Transformer 模型的医学图像分类、分割、配准和重建方法，以及与 CNNs 模型在标准医学数据集上的性能比较。

医学图像中的视觉变压器：综述

Vision Transformers in Medical Imaging: A Review

In real-world scenarios, many data processing problems often involve
heterogeneous images associated with different imaging modalities. Since these
multimodal images originate from the same phenomenon, it is realistic to assume
that they share common attributes or characteristics. In this paper, we propose
a multi-modal image processing framework based on coupled dictionary learning
to capture similarities and disparities between different image modalities. In
particular, our framework can capture favorable structure similarities across
different image modalities such as edges, corners, and other elementary
primitives in a learned sparse transform domain, instead of the original pixel
domain, that can be used to improve a number of image processing tasks such as
denoising, inpainting, or super-resolution. Practical experiments demonstrate
that incorporating multimodal information using our framework brings notable
benefits.

该论文提出了一种基于耦合字典学习的多模态图像处理框架，能够在所学的稀疏变换域中捕捉不同图像模态之间的相似性和差异性，并能够用于改善图像处理任务，如图像去噪。