In the past year, multimodal large language models (MLLMs) have demonstrated
remarkable performance in tasks such as visual question answering, visual
understanding and reasoning. However, the extensive model size and high
training and inference costs have hindered the widespread appli