Visual question answering (VQA) in surgery is largely unexplored. Expert
surgeons are scarce and are often overloaded with clinical and academic
workloads. This overload often limits their time answering questionnaires from
patients, medical students or junior residents related to surgical procedures.
At times, students and junior residents also refrain from asking too many
questions during classes to reduce disruption. While computer-aided simulators
and recording of past surgical procedures have been made available for them to
observe and improve their skills, they still hugely rely on medical experts to
answer their questions. Having a Surgical-VQA system as a reliable 'second
opinion' could act as a backup and ease the load on the medical experts in
answering these questions. The lack of annotated medical data and the presence
of domain-specific terms has limited the exploration of VQA for surgical
procedures. In this work, we design a Surgical-VQA task that answers
questionnaires on surgical procedures based on the surgical scene. Extending
the MICCAI endoscopic vision challenge 2018 dataset and workflow recognition
dataset further, we introduce two Surgical-VQA datasets with classification and
sentence-based answers. To perform Surgical-VQA, we employ vision-text
transformers models. We further introduce a residual MLP-based VisualBert
encoder model that enforces interaction between visual and text tokens,
improving performance in classification-based answering. Furthermore, we study
the influence of the number of input image patches and temporal visual features
on the model performance in both classification and sentence-based answering.

我们设计了一个基于医学图像的手术问答系统，使用视觉和文本转换模型，并通过两个 Surgical-VQA 数据集验证了所提出的方法，结合分类和基于句子的答案以回答关于手术程序的问卷调查。

Surgical-VQA: 使用 Transformer 解决手术场景中的视觉问答问题

Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

Our objective is language-based search of large-scale image and video
datasets. For this task, the approach that consists of independently mapping
text and vision to a joint embedding space, a.k.a. dual encoders, is attractive
as retrieval scales and is efficient for billions of images using approximate
nearest neighbour search. An alternative approach of using vision-text
transformers with cross-attention gives considerable improvements in accuracy
over the joint embeddings, but is often inapplicable in practice for
large-scale retrieval given the cost of the cross-attention mechanisms required
for each sample at test time. This work combines the best of both worlds. We
make the following three contributions. First, we equip transformer-based
models with a new fine-grained cross-attention architecture, providing
significant improvements in retrieval accuracy whilst preserving scalability.
Second, we introduce a generic approach for combining a Fast dual encoder model
with our Slow but accurate transformer-based model via distillation and
re-ranking. Finally, we validate our approach on the Flickr30K image dataset
where we show an increase in inference speed by several orders of magnitude
while having results competitive to the state of the art. We also extend our
method to the video domain, improving the state of the art on the VATEX
dataset.

本研究通过将视觉和文本独立地映射到联合嵌入空间中的双编码器方法和使用跨注意力的视觉文本变压器方法来进行大规模图像和视频数据集的基于语言的搜索，并将两种方法相结合，提高了检索准确性并确保了可扩展性，同时还引入了新的细粒度跨注意力架构，并通过蒸馏和重新排序结合了快速双编码器模型和缓慢但准确的变压器模型，并在 Flickr30K 图像数据集和 VATEX 视频数据集上验证了该方法。