Purpose: General consensus amongst researchers and industry points to a lack
of large, representative annotated datasets as the biggest obstacle to progress
in the field of surgical data science. Self-supervised learning represents a
solution to part of this problem, removing the reliance on annotations.
However, the robustness of current self-supervised learning methods to domain
shifts remains unclear, limiting our understanding of its utility for
leveraging diverse sources of surgical data. Methods: In this work, we employ
self-supervised learning to flexibly leverage diverse surgical datasets,
thereby learning taskagnostic representations that can be used for various
surgical downstream tasks. Based on this approach, to elucidate the impact of
pre-training on downstream task performance, we explore 22 different
pre-training dataset combinations by modulating three variables: source
hospital, type of surgical procedure, and pre-training scale (number of
videos). We then finetune the resulting model initializations on three diverse
downstream tasks: namely, phase recognition and critical view of safety in
laparoscopic cholecystectomy and phase recognition in laparoscopic
hysterectomy. Results: Controlled experimentation highlights sizable boosts in
performance across various tasks, datasets, and labeling budgets. However, this
performance is intricately linked to the composition of the pre-training
dataset, robustly proven through several study stages. Conclusion: The
composition of pre-training datasets can severely affect the effectiveness of
SSL methods for various downstream tasks and should critically inform future
data collection efforts to scale the application of SSL methodologies.
Keywords: Self-Supervised Learning, Transfer Learning, Surgical Computer
Vision, Endoscopic Videos, Critical View of Safety, Phase Recognition

通过自我监督学习，在不同手术数据集上进行预训练，从而灵活地利用多样化的手术数据，为各种手术下游任务学习与任务无关的表示，研究发现预训练数据集的组成严重影响自我监督学习方法在各种下游任务上的有效性，对于规模化应用自我监督学习方法应充分考虑预训练数据集的组成。

外科计算机视觉的启动

Jumpstarting Surgical Computer Vision

In surgical computer vision applications, obtaining labeled training data is
challenging due to data-privacy concerns and the need for expert annotation.
Unpaired image-to-image translation techniques have been explored to
automatically generate large annotated datasets by translating synthetic images
to the realistic domain. However, preserving the structure and semantic
consistency between the input and translated images presents significant
challenges, mainly when there is a distributional mismatch in the semantic
characteristics of the domains. This study empirically investigates unpaired
image translation methods for generating suitable data in surgical
applications, explicitly focusing on semantic consistency. We extensively
evaluate various state-of-the-art image translation models on two challenging
surgical datasets and downstream semantic segmentation tasks. We find that a
simple combination of structural-similarity loss and contrastive learning
yields the most promising results. Quantitatively, we show that the data
generated with this approach yields higher semantic consistency and can be used
more effectively as training data.

探索了无配对图像翻译技术在手术应用中生成具有语义一致性数据的可行性，并发现结构相似性损失和对比学习的简单组合方法取得了最有希望的结果。定量分析表明，使用这种方法生成的数据具有更高的语义一致性，可以更有效地用作训练数据。

探索非配对图像翻译中的语义一致性，以生成外科应用数据

Exploring Semantic Consistency in Unpaired Image Translation to Generate  Data for Surgical Applications

Recent advancements in surgical computer vision applications have been driven
by fully-supervised methods, primarily using only visual data. These methods
rely on manually annotated surgical videos to predict a fixed set of object
categories, limiting their generalizability to unseen surgical procedures and
downstream tasks. In this work, we put forward the idea that the surgical video
lectures available through open surgical e-learning platforms can provide
effective supervisory signals for multi-modal representation learning without
relying on manual annotations. We address the surgery-specific linguistic
challenges present in surgical video lectures by employing multiple
complementary automatic speech recognition systems to generate text
transcriptions. We then present a novel method, SurgVLP - Surgical Vision
Language Pre-training, for multi-modal representation learning. SurgVLP
constructs a new contrastive learning objective to align video clip embeddings
with the corresponding multiple text embeddings by bringing them together
within a joint latent space. To effectively show the representation capability
of the learned joint latent space, we introduce several vision-and-language
tasks for surgery, such as text-based video retrieval, temporal activity
grounding, and video captioning, as benchmarks for evaluation. We further
demonstrate that without using any labeled ground truth, our approach can be
employed for traditional vision-only surgical downstream tasks, such as
surgical tool, phase, and triplet recognition. The code will be made available
at this https URL

该研究使用手术视频讲座来进行多模态表示学习，通过自动生成的文本转录来解决手术视频中的语言挑战，提出了一种新的对齐视频和文本嵌入的方法 SurgVLP，并介绍了一些用于手术的视觉与语言任务作为评估标准。