Multi-modal transformers mark significant progress in different domains, but
siloed high-quality data hinders their further improvement. To remedy this,
federated learning (FL) has emerged as a promising privacy-preserving paradigm
for training models without direct access to the raw data held by different
clients. Despite its potential, a considerable research direction regarding the
unpaired uni-modal clients and the transformer architecture in FL remains
unexplored. To fill this gap, this paper explores a transfer multi-modal
federated learning (MFL) scenario within the vision-language domain, where
clients possess data of various modalities distributed across different
datasets. We systematically evaluate the performance of existing methods when a
transformer architecture is utilized and introduce a novel framework called
Federated modality complementary and collaboration (FedCola) by addressing the
in-modality and cross-modality gaps among clients. Through extensive
experiments across various FL settings, FedCola demonstrates superior
performance over previous approaches, offering new perspectives on future
federated training of multi-modal transformers.

在视觉语言领域中，通过利用转换器架构对现有方法进行系统评估，并引入一种名为 FedCola 的新框架，填补了关于不配对的单模客户端和 FL 中转换器架构的研究空白。通过在各种 FL 设置下进行广泛实验，FedCola 展示了优于先前方法的性能，为未来多模态转换器的联邦训练提供了新的观点。

朝着联邦学习中的多模态 Transformer

Towards Multi-modal Transformers in Federated Learning

Video summarization has become an increasingly important task in the field of
computer vision due to the vast amount of video content available on the
internet. In this project, we propose a new method for natural language query
based joint video summarization and highlight detection using multi-modal
transformers. This approach will use both visual and audio cues to match a
user's natural language query to retrieve the most relevant and interesting
moments from a video. Our approach employs multiple recent techniques used in
Vision Transformers (ViTs) to create a transformer-like encoder-decoder model.
We evaluated our approach on multiple datasets such as YouTube Highlights and
TVSum to demonstrate the flexibility of our proposed method.

本计划提出了一种新方法，使用多模式 Transformer 进行基于自然语言查询的视频摘要和亮点检测，以匹配用户自然语言查询来检索视频中最相关和最有趣的时刻， 并在多个数据集上进行评估，如 YouTube 亮点和 TVSum。

通过自然语言查询进行联合时刻检索和高亮检测

Joint Moment Retrieval and Highlight Detection Via Natural Language  Queries

We propose Pixel-BERT to align image pixels with text by deep multi-modal
transformers that jointly learn visual and language embedding in a unified
end-to-end framework. We aim to build a more accurate and thorough connection
between image pixels and language semantics directly from image and sentence
pairs instead of using region-based image features as the most recent vision
and language tasks. Our Pixel-BERT which aligns semantic connection in pixel
and text level solves the limitation of task-specific visual representation for
vision and language tasks. It also relieves the cost of bounding box
annotations and overcomes the unbalance between semantic labels in visual task
and language semantic. To provide a better representation for down-stream
tasks, we pre-train a universal end-to-end model with image and sentence pairs
from Visual Genome dataset and MS-COCO dataset. We propose to use a random
pixel sampling mechanism to enhance the robustness of visual representation and
to apply the Masked Language Model and Image-Text Matching as pre-training
tasks. Extensive experiments on downstream tasks with our pre-trained model
show that our approach makes the most state-of-the-arts in downstream tasks,
including Visual Question Answering (VQA), image-text retrieval, Natural
Language for Visual Reasoning for Real (NLVR). Particularly, we boost the
performance of a single model in VQA task by 2.17 points compared with SOTA
under fair comparison.

Pixel-BERT 是一种多模态的深度转换器，可以通过使用图像和文本数据对其进行联合学习，从而在像素和文本级别上进行语义连接，实现视觉和语言任务的更准确和彻底的连接，并解决了视觉任务中语义标签不平衡的问题。