Visual Question Answering (VQA) within the surgical domain, utilizing Large
Language Models (LLMs), offers a distinct opportunity to improve
intra-operative decision-making and facilitate intuitive surgeon-AI
interaction. However, the development of LLMs for surgical VQA is hindered by
the scarcity of diverse and extensive datasets with complex reasoning tasks.
Moreover, contextual fusion of the image and text modalities remains an open
research challenge due to the inherent differences between these two types of
information and the complexity involved in aligning them. This paper introduces
PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary
surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded
text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a
rich collection of question-answer pairs spanning crucial surgical aspects such
as phase and step recognition, context understanding, tool detection and
localization, and tool-tissue interactions. PitVQA-Net consists of a novel
image-grounded text embedding that projects image and text features into a
shared embedding space and GPT2 Backbone with an excitation block
classification head to generate contextually relevant answers within the
complex domain of endonasal pituitary surgery. Our image-grounded text
embedding leverages joint embedding, cross-attention and contextual
representation to understand the contextual relationship between questions and
surgical images. We demonstrate the effectiveness of PitVQA-Net on both the
PitVQA and the publicly available EndoVis18-VQA dataset, achieving improvements
in balanced accuracy of 8% and 9% over the most recent baselines, respectively.
Our code and dataset is available at this https URL

本文提出了 PitVQA 和 PitVQA-Net，通过图像和文本信息的联合嵌入和上下文表示，解决了对内窥镜垂体手术领域中复杂的问答任务的挑战，并在 PitVQA 和 EndoVis18-VQA 数据集上取得了显著性能改进。