We present VisionLLM v2, an end-to-end generalist multimodal large model
(MLLM) that unifies visual perception, understanding, and generation within a
single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2
significantly broadens its application scope. It excels not only in
conventional visual question answering (VQA) but also in open-ended,
cross-domain vision tasks such as object localization, pose estimation, and
image generation and editing. To this end, we propose a new information
transmission mechanism termed "super link", as a medium to connect MLLM with
task-specific decoders. It not only allows flexible transmission of task
information and gradient feedback between the MLLM and multiple downstream
decoders but also effectively resolves training conflicts in multi-tasking
scenarios. In addition, to support the diverse range of tasks, we carefully
collected and combed training data from hundreds of public vision and
vision-language tasks. In this way, our model can be joint-trained end-to-end
on hundreds of vision language tasks and generalize to these tasks using a set
of shared parameters through different user prompts, achieving performance
comparable to task-specific models. We believe VisionLLM v2 will offer a new
perspective on the generalization of MLLMs.

VisionLLM v2 是一种端到端的多模态大型模型，它在一个框架中统一了视觉感知、理解和生成。它通过一种名为 “超级链接” 的信息传输机制连接了模型与特定任务解码器，以实现灵活的任务信息传输和梯度反馈，并在多任务场景中解决训练冲突，并通过不同的用户提示实现对多种视觉语言任务的端到端联合训练和泛化，达到与特定任务模型相当的性能。

VisionLLM v2：一种适用于数百种视觉语言任务的端到端通用多模态大语言模型

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model  for Hundreds of Vision-Language Tasks

Despite the success of Large Language Models (LLMs) in general image tasks, a
gap persists in the medical field for a multimodal large model adept at
handling the nuanced diversity of medical images. Addressing this, we propose
MedXChat, a unified multimodal large model designed for seamless interactions
between medical assistants and users. MedXChat encompasses three key
functionalities: CXR(Chest X-ray)-to-Report generation, CXR-based visual
question-answering (VQA), and Text-to-CXR synthesis. Our contributions are as
follows. Firstly, our model showcases exceptional cross-task adaptability,
displaying adeptness across all three defined tasks and outperforming the
benchmark models on the MIMIC dataset in medical multimodal applications.
Secondly, we introduce an innovative Text-to-CXR synthesis approach that
utilizes instruction-following capabilities within the Stable Diffusion (SD)
architecture. This technique integrates smoothly with the existing model
framework, requiring no extra parameters, thereby maintaining the SD's
generative strength while also bestowing upon it the capacity to render
fine-grained medical images with high fidelity. Comprehensive experiments
validate MedXChat's synergistic enhancement across all tasks. Our instruction
data and model will be open-sourced.

MedXChat 是一个用于医学助理和用户之间无缝互动的统一多模态大型模型，包括 CXR 到报告生成、基于 CXR 的视觉问答和文本到 CXR 合成三个关键功能。该模型在医学多模态应用中显示出优异的跨任务适应性，并在 MIMIC 数据集上的性能超越了基准模型。此外，该研究还介绍了一种创新的文本到 CXR 合成方法，利用了 Stable Diffusion（SD）架构内的指令跟随能力，无需额外参数，使模型能够生成高保真度的精细化医学图像。详尽的实验证实了 MedXChat 在所有任务上的协同增强效果。研究中的指令数据和模型将开源。