Visual Question Answering (VQA) in the medical domain presents a unique, interdisciplinary challenge, combining fields such as Computer Vision, Natural Language Processing, and Knowledge Representation. Despite its importance, research in medical VQA has been scant, only gaining momentum since 2018. Addressing this gap, our research delves into the effective representation of radiology images and the joint learning of multimodal representations, surpassing existing methods. We innovatively augment the SLAKE dataset, enabling our model to respond to a more diverse array of questions, not limited to the immediate content of radiology or pathology images. Our model achieves a top-1 accuracy of 79.55\% with a less complex architecture, demonstrating comparable performance to current state-of-the-art models. This research not only advances medical VQA but also opens avenues for practical applications in diagnostic settings.

医学领域中的视觉问答（VQA）面临独特的、跨学科的挑战，结合了计算机视觉、自然语言处理和知识表示等领域。本研究针对这一研究领域的空白，探讨了放射学图像的有效表示和多模态表示的联合学习，超越了现有方法。我们创新性地增强了SLAKE数据集，使我们的模型能够回答更多样化的问题，不仅限于放射学或病理学图像的直接内容。我们的模型以较简单的架构实现了79.55%的top-1准确度，表现与当前最先进的模型相当。这项研究不仅推进了医学VQA，还在诊断环境中开辟了实用应用的途径。

放射学中的医学图像自由形式问答