Active Speaker Detection (ASD) aims to identify who is speaking in each frame
of a video. ASD reasons from audio and visual information from two contexts:
long-term intra-speaker context and short-term inter-speaker context. Long-term
intra-speaker context models the temporal dependencies of the same speaker,
while short-term inter-speaker context models the interactions of speakers in
the same scene. These two contexts are complementary to each other and can help
infer the active speaker. Motivated by these observations, we propose LoCoNet,
a simple yet effective Long-Short Context Network that models the long-term
intra-speaker context and short-term inter-speaker context. We use
self-attention to model long-term intra-speaker context due to its
effectiveness in modeling long-range dependencies, and convolutional blocks
that capture local patterns to model short-term inter-speaker context.
Extensive experiments show that LoCoNet achieves state-of-the-art performance
on multiple datasets, achieving an mAP of 95.2%(+1.1%) on AVA-ActiveSpeaker,
68.1%(+22%) on Columbia dataset, 97.2%(+2.8%) on Talkies dataset and
59.7%(+8.0%) on Ego4D dataset. Moreover, in challenging cases where multiple
speakers are present, or face of active speaker is much smaller than other
faces in the same scene, LoCoNet outperforms previous state-of-the-art methods
by 3.4% on the AVA-ActiveSpeaker dataset. The code will be released at
this https URL

该研究提出了一种名为 LoCoNet 的模型，包含长期内讲话者历史和短期内各个讲话者之间的交互信息，使用自注意力机制和卷积块分别建模两种信息，并在多个数据集上取得了最先进的表现。

LoCoNet：长短时序上下文网络用于活跃说话人检测

LoCoNet: Long-Short Context Network for Active Speaker Detection

Dense video captioning aims to localize and describe important events in
untrimmed videos. Existing methods mainly tackle this task by exploiting only
visual features, while completely neglecting the audio track. Only a few prior
works have utilized both modalities, yet they show poor results or demonstrate
the importance on a dataset with a specific domain. In this paper, we introduce
Bi-modal Transformer which generalizes the Transformer architecture for a
bi-modal input. We show the effectiveness of the proposed model with audio and
visual modalities on the dense video captioning task, yet the module is capable
of digesting any two modalities in a sequence-to-sequence task. We also show
that the pre-trained bi-modal encoder as a part of the bi-modal transformer can
be used as a feature extractor for a simple proposal generation module. The
performance is demonstrated on a challenging ActivityNet Captions dataset where
our model achieves outstanding performance. The code is available:
v-iashin.github.io/bmt

本文介绍了一种基于 Transformer 结构的双模态编码器，用于处理 Dense Video Captioning 任务，通过同时处理视频和音频两种输入，该模型在 ActivityNet Captions 数据集上取得了出色的性能表现。