Humans possess the capability to comprehend diverse modalities and seamlessly
transfer information between them. In this work, we introduce ModaVerse, a
Multi-modal Large Language Model (MLLM) capable of comprehending and
transforming content across various modalities including images, videos, and
audio. Predominant MLLM frameworks have largely relied on the alignment of
latent spaces of textual and non-textual features. This alignment process,
which synchronizes a language model trained on textual data with encoders and
decoders trained on multi-modal data, often necessitates extensive training of
several projection layers in multiple stages. Inspired by LLM-as-agent
methodologies, we propose a novel Input/Output (I/O) alignment mechanism that
operates directly at the level of natural language. It aligns the LLM's output
with the input of generative models, avoiding the complexities associated with
latent feature alignments, and simplifying the multiple training stages of
existing MLLMs into a single, efficient process. This conceptual advancement
leads to significant reductions in both data and computational costs. By
conducting experiments on several benchmarks, we demonstrate that our approach
attains comparable performance with the state of the art while achieving
considerable efficiencies in data usage and training duration.

引入了 ModaVerse，一种多模态大型语言模型（MLLM），能够理解和转换图像、视频和音频等不同模态的内容。通过在自然语言层面上进行输入 / 输出对齐，避免了潜在特征对齐的复杂性，简化了现有 MLLM 的多个训练阶段，从而显著降低了数据和计算成本。在多个基准实验中，我们的方法取得与最先进技术相当的性能，同时在数据使用和训练时间上实现了显著的效率提升。

ModaVerse: 用 LLMs 高效转换模态

ModaVerse: Efficiently Transforming Modalities with LLMs

Image-text retrieval is a widely studied topic in the field of computer
vision due to the exponential growth of multimedia data, whose core concept is
to measure the similarity between images and text. However, most existing
retrieval methods heavily rely on cross-attention mechanisms for cross-modal
fine-grained alignment, which takes into account excessive irrelevant regions
and treats prominent and non-significant words equally, thereby limiting
retrieval accuracy. This paper aims to investigate an alignment approach that
reduces the involvement of non-significant fragments in images and text while
enhancing the alignment of prominent segments. For this purpose, we introduce
the Cross-Modal Prominent Fragments Enhancement Aligning Network(CPFEAN), which
achieves improved retrieval accuracy by diminishing the participation of
irrelevant regions during alignment and relatively increasing the alignment
similarity of prominent words. Additionally, we incorporate prior textual
information into image regions to reduce misalignment occurrences. In practice,
we first design a novel intra-modal fragments relationship reasoning method,
and subsequently employ our proposed alignment mechanism to compute the
similarity between images and text. Extensive quantitative comparative
experiments on MS-COCO and Flickr30K datasets demonstrate that our approach
outperforms state-of-the-art methods by about 5% to 10% in the rSum metric.

通过降低非重要图片和文本片段的参与度，提高对重要片段的对齐相似性，本文介绍了一种新的跨模态突出片段增强对齐网络 (CPFEAN)，该网络通过减少在对齐过程中无关区域的参与度并相对提高对齐的突出词，从而实现改进的检索准确性。与最先进的方法相比，在 MS-COCO 和 Flickr30K 数据集上进行了大量定量比较实验，结果显示本方法在 rSum 度量上的表现超过了现有方法约 5% 至 10%。

跨模态突出片段增强对齐网络：图像 - 文本检索

Cross-modal Prominent Fragments Enhancement Aligning Network for  Image-text Retrieval

Attention-based sequence-to-sequence models for speech recognition jointly
train an acoustic model, language model (LM), and alignment mechanism using a
single neural network and require only parallel audio-text pairs. Thus, the
language model component of the end-to-end model is only trained on transcribed
audio-text pairs, which leads to performance degradation especially on rare
words. While there have been a variety of work that look at incorporating an
external LM trained on text-only data into the end-to-end framework, none of
them have taken into account the characteristic error distribution made by the
model. In this paper, we propose a novel approach to utilizing text-only data,
by training a spelling correction (SC) model to explicitly correct those
errors. On the LibriSpeech dataset, we demonstrate that the proposed model
results in an 18.6% relative improvement in WER over the baseline model when
directly correcting top ASR hypothesis, and a 29.0% relative improvement when
further rescoring an expanded n-best list using an external LM.

该研究提出了一种新方法以训练拼写纠错模型来纠正注意力机制序列到序列语音识别模型中的错误，从而改进了性能。在 LibriSpeech 数据集上，该模型相对于基线模型的相对改进为 18.6％，相对于使用扩展语言模型重新评分的 n-best 列表的改进为 29.0％。