This report introduces our novel method named STHG for the Audio-Visual
Diarization task of the Ego4D Challenge 2023. Our key innovation is that we
model all the speakers in a video using a single, unified heterogeneous graph
learning framework. Unlike previous approaches that require a separate
component solely for the camera wearer, STHG can jointly detect the speech
activities of all people including the camera wearer. Our final method obtains
61.1% DER on the test set of Ego4D, which significantly outperforms all the
baselines as well as last year's winner. Our submission achieved 1st place in
the Ego4D Challenge 2023. We additionally demonstrate that applying the
off-the-shelf speech recognition system to the diarized speech segments by STHG
produces a competitive performance on the Speech Transcription task of this
challenge.

本研究文章介绍我们的新方法 STHG，采用统一的异构图学习框架对视频中的所有说话者进行建模，应用于 Ego4D Challenge 2023 的 Audio-Visual Diarization 任务，取得了 61.1% 的 DER 表现，并在该挑战赛中获得了第一名，同时展示了将该方法应用于 Speech Transcription 任务时表现优异的结果。

基于时空异构图学习的高级音视频日志化技术

STHG: Spatial-Temporal Heterogeneous Graph Learning for Advanced  Audio-Visual Diarization

TV subtitles are a rich source of transcriptions of many types of speech,
ranging from read speech in news reports to conversational and spontaneous
speech in talk shows and soaps. However, subtitles are not verbatim (i.e.
exact) transcriptions of speech, so they cannot be used directly to improve an
Automatic Speech Recognition (ASR) model. We propose a multitask dual-decoder
Transformer model that jointly performs ASR and automatic subtitling. The ASR
decoder (possibly pre-trained) predicts the verbatim output and the subtitle
decoder generates a subtitle, while sharing the encoder. The two decoders can
be independent or connected. The model is trained to perform both tasks
jointly, and is able to effectively use subtitle data. We show improvements on
regular ASR and on spontaneous and conversational ASR by incorporating the
additional subtitle decoder. The method does not require preprocessing
(aligning, filtering, pseudo-labeling, ...) of the subtitles.

本研究提出一种利用电视字幕数据进行语音识别与自动字幕生成的多任务双解码器 Transformer 模型，通过模型共享的编码器，同时预测语音和生成字幕，无需预处理，实现了 ASR 性能的提升。