In healthcare, multimodal data is prevalent and requires to be
comprehensively analyzed before diagnostic decisions, including medical images,
clinical reports, etc. However, current large-scale artificial intelligence
models predominantly focus on single-modal cognitive abilities and