We introduce X-VILA, an omni-modality model designed to extend the
capabilities of large language models (LLMs) by incorporating image, video, and
audio modalities. By aligning modality-specific encoders with LLM inputs and
diffusion decoders with LLM outputs, X-VILA achieves cross-modality
understanding, reasoning, and generation. To facilitate this cross-modality
alignment, we curate an effective interleaved any-to-any modality
instruction-following dataset. Furthermore, we identify a significant problem
with the current cross-modality alignment method, which results in visual
information loss. To address the issue, we propose a visual alignment mechanism
with a visual embedding highway module. We then introduce a resource-efficient
recipe for training X-VILA, that exhibits proficiency in any-to-any modality
conversation, surpassing previous approaches by large margins. X-VILA also
showcases emergent properties across modalities even in the absence of similar
training data. The project will be made open-source.

X-VILA 是一种全模式模型，通过结合图像、视频和音频模态来扩展大型语言模型（LLMs）的能力，实现跨模态的理解、推理和生成。在此基础上，通过一个有效的交错的任意 - 任意模态指令跟踪数据集以及一种视觉嵌入高速公路模块，解决了当前交叉模态对齐方法中的视觉信息丢失问题，从而在任意 - 任意模态对话方面表现出了比以前方法更高的效率。