Semantic segmentation in surgical videos has applications in intra-operative
guidance, post-operative analytics and surgical education. Segmentation models
need to provide accurate and consistent predictions since temporally
inconsistent identification of anatomical structures can impair usability and
hinder patient safety. Video information can alleviate these challenges leading
to reliable models suitable for clinical use. We propose a novel architecture
for modelling temporal relationships in videos. The proposed model includes a
spatio-temporal decoder to enable video semantic segmentation by improving
temporal consistency across frames. The encoder processes individual frames
whilst the decoder processes a temporal batch of adjacent frames. The proposed
decoder can be used on top of any segmentation encoder to improve temporal
consistency. Model performance was evaluated on the CholecSeg8k dataset and a
private dataset of robotic Partial Nephrectomy procedures. Segmentation
performance was improved when the temporal decoder was applied across both
datasets. The proposed model also displayed improvements in temporal
consistency.

在手术视频中进行语义分割在术中导航、术后分析和手术教育方面有应用价值。我们提出了一种用于建模视频时间关系的新架构，通过改善帧之间的时间一致性以提高视频语义分割精度，并在两个数据集上验证了其性能提升。

外科手术视频语义分割的时空网络

A spatio-temporal network for video semantic segmentation in surgical  videos

We present a multimodal framework to learn general audio representations from
videos. Existing contrastive audio representation learning methods mainly focus
on using the audio modality alone during training. In this work, we show that
additional information contained in video can be utilized to greatly improve
the learned features. First, we demonstrate that our contrastive framework does
not require high resolution images to learn good audio features. This allows us
to scale up the training batch size, while keeping the computational load
incurred by the additional video modality to a reasonable level. Second, we use
augmentations that mix together different samples. We show that this is
effective to make the proxy task harder, which leads to substantial performance
improvements when increasing the batch size. As a result, our audio model
achieves a state-of-the-art of 42.4 mAP on the AudioSet classification
downstream task, closing the gap between supervised and self-supervised methods
trained on the same dataset. Moreover, we show that our method is advantageous
on a broad range of non-semantic audio tasks, including speaker identification,
keyword spotting, language identification, and music instrument classification.

通过使用多模态框架，在训练音频表征时利用视频信息和加入混合样本的数据增强，本研究的对比学习框架成功地实现了在非语义音频任务上的领先水平。

多模态自监督学习通用音频表示

Multimodal Self-Supervised Learning of General Audio Representations

We investigate video-aided grammar induction, which learns a constituency
parser from both unlabeled text and its corresponding video. Existing methods
of multi-modal grammar induction focus on learning syntactic grammars from
text-image pairs, with promising results showing that the information from
static images is useful in induction. However, videos provide even richer
information, including not only static objects but also actions and state
changes useful for inducing verb phrases. In this paper, we explore rich
features (e.g. action, object, scene, audio, face, OCR and speech) from videos,
taking the recent Compound PCFG model as the baseline. We further propose a
Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich
features from different modalities. Our proposed MMC-PCFG is trained end-to-end
and outperforms each individual modality and previous state-of-the-art systems
on three benchmarks, i.e. DiDeMo, YouCook2 and MSRVTT, confirming the
effectiveness of leveraging video information for unsupervised grammar
induction.

本研究旨在探索利用视频信息进行语法归纳，通过提取丰富的视频特征，使用多模混合概率上下文无关文法模型（MMC-PCFG）进行端到端的无监督语法归纳，实验结果表明该模型在未标注的文本和视频中表现出色。

视频辅助的无监督语法归纳

Video-aided Unsupervised Grammar Induction

Visual keyword spotting (KWS) is the problem of estimating whether a text
query occurs in a given recording using only video information. This paper
focuses on visual KWS for words unseen during training, a real-world, practical
setting which so far has received no attention by the community. To this end,
we devise an end-to-end architecture comprising (a) a state-of-the-art visual
feature extractor based on spatiotemporal Residual Networks, (b) a
grapheme-to-phoneme model based on sequence-to-sequence neural networks, and
(c) a stack of recurrent neural networks which learn how to correlate visual
features with the keyword representation. Different to prior works on KWS,
which try to learn word representations merely from sequences of graphemes
(i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder
model which learns how to map words to their pronunciation. We demonstrate that
our system obtains very promising visual-only KWS results on the challenging
LRS2 database, for keywords unseen during training. We also show that our
system outperforms a baseline which addresses KWS via automatic speech
recognition (ASR), while it drastically improves over other recently proposed
ASR-free KWS methods.

本论文针对实际应用中未被训练过的词语进行视觉关键词检测的问题，并使用端到端的多层神经网络架构，使用语音图形编码器解决了此问题，该模型在 LRS2 数据集上取得了非常有前途的结果。