This paper presents a novel framework for joint speaker diarization (SD) and
automatic speech recognition (ASR), named SLIDAR (sliding-window
diarization-augmented recognition). SLIDAR can process arbitrary length inputs
and can handle any number of speakers, effectively solving ``who spoke what,
when'' concurrently. SLIDAR leverages a sliding window approach and consists of
an end-to-end diarization-augmented speech transcription (E2E DAST) model which
provides, locally, for each window: transcripts, diarization and speaker
embeddings. The E2E DAST model is based on an encoder-decoder architecture and
leverages recent techniques such as serialized output training and
``Whisper-style" prompting. The local outputs are then combined to get the
final SD+ASR result by clustering the speaker embeddings to get global speaker
identities. Experiments performed on monaural recordings from the AMI corpus
confirm the effectiveness of the method in both close-talk and far-field speech
scenarios.

这篇论文提出了一个名为 SLIDAR（滑动窗口判别增强识别）的新颖框架，用于联合演讲者判别和自动语音识别，能够处理任意长度的输入和任意数量的说话人，通过滑动窗口方法实时给出窗口内的转录、判别和说话人嵌入，并通过聚类说话人嵌入获得全局演讲者身份，实验证实了该方法在近距离和远场语音场景中的有效性。