Multi-modal transformers mark significant progress in different domains, but
siloed high-quality data hinders their further improvement. To remedy this,
federated learning (FL) has emerged as a promising privacy-preserving paradigm
for training models without direct access to the raw data held by different
clients. Despite its potential, a considerable research direction regarding the
unpaired uni-modal clients and the transformer architecture in FL remains
unexplored. To fill this gap, this paper explores a transfer multi-modal
federated learning (MFL) scenario within the vision-language domain, where
clients possess data of various modalities distributed across different
datasets. We systematically evaluate the performance of existing methods when a
transformer architecture is utilized and introduce a novel framework called
Federated modality complementary and collaboration (FedCola) by addressing the
in-modality and cross-modality gaps among clients. Through extensive
experiments across various FL settings, FedCola demonstrates superior
performance over previous approaches, offering new perspectives on future
federated training of multi-modal transformers.

在视觉语言领域中，通过利用转换器架构对现有方法进行系统评估，并引入一种名为 FedCola 的新框架，填补了关于不配对的单模客户端和 FL 中转换器架构的研究空白。通过在各种 FL 设置下进行广泛实验，FedCola 展示了优于先前方法的性能，为未来多模态转换器的联邦训练提供了新的观点。

朝着联邦学习中的多模态 Transformer

Towards Multi-modal Transformers in Federated Learning

Mechanistic interpretability seeks to understand the neural mechanisms that
enable specific behaviors in Large Language Models (LLMs) by leveraging
causality-based methods. While these approaches have identified neural circuits
that copy spans of text, capture factual knowledge, and more, they remain
unusable for multimodal models since adapting these tools to the
vision-language domain requires considerable architectural changes. In this
work, we adapt a unimodal causal tracing tool to BLIP to enable the study of
the neural mechanisms underlying image-conditioned text generation. We
demonstrate our approach on a visual question answering dataset, highlighting
the causal relevance of later layer representations for all tokens.
Furthermore, we release our BLIP causal tracing tool as open source to enable
further experimentation in vision-language mechanistic interpretability by the
community. Our code is available at
this https URL

通过引入一种单模态因果追踪工具，我们适应了 BLIP 以研究图像条件下文本生成的神经机制，并在视觉问答数据集上展示了我们的方法，强调了较晚层表示对所有标记的因果相关性。此外，我们将我们的 BLIP 因果追踪工具开源，以便社区进一步探索视觉语言机制可解释性。

迈向视觉语言机制可解释性：一种用于 BLIP 的因果追踪工具

Towards Vision-Language Mechanistic Interpretability: A Causal Tracing  Tool for BLIP

Large-scale language models have shown the ability to adapt to a new task via
conditioning on a few demonstrations (i.e., in-context learning). However, in
the vision-language domain, most large-scale pre-trained vision-language (VL)
models do not possess the ability to conduct in-context learning. How can we
enable in-context learning for VL models? In this paper, we study an
interesting hypothesis: can we transfer the in-context learning ability from
the language domain to VL domain? Specifically, we first meta-trains a language
model to perform in-context learning on NLP tasks (as in MetaICL); then we
transfer this model to perform VL tasks by attaching a visual encoder. Our
experiments suggest that indeed in-context learning ability can be transferred
cross modalities: our model considerably improves the in-context learning
capability on VL tasks and can even compensate for the size of the model
significantly. On VQA, OK-VQA, and GQA, our method could outperform the
baseline model while having 20 times fewer parameters.

本文研究如何使图像 - 语言领域的大规模预训练模型具备上下文学习的能力，通过将自然语言处理领域的元学习应用于视觉 - 语言领域，并使用视觉编码器实现跨域转移学习，实验证明可以显著提高视觉问答任务的上下文学习能力，甚至可以补偿模型的大小并取得比基线模型更好的表现。