Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamlining data preparation and build new benchmark MedVQA datasets R-RAD and R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing MedVQA datasets, i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process. The framework includes three distinct strategies to generate decision outcomes and corresponding rationales, thereby clearly showcasing the medical decision-making process during reasoning. Extensive experiments demonstrate that our method can achieve an accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperforming existing state-of-the-art baselines. Dataset and code will be released.

通过设计半自动注释过程，构建了基于多模态大型语言模型生成中间医疗决策理由的新的基准MedVQA数据集R-RAD和R-SLAKE，并将其纳入训练过程中，通过三种不同的策略生成决策结果和相应的理由，从而清楚地展示推理过程中的医疗决策过程，实验证明该方法在R-RAD上能达到83.5%的准确率，在R-SLAKE上能达到86.3%的准确率，显著优于现有最先进的基线模型。

MedThink：通过多模态决策理由解释医学视觉问题回答