While direction of arrival (DOA) of sound events is generally estimated from
multichannel audio data recorded in a microphone array, sound events usually
derive from visually perceptible source objects, e.g., sounds of footsteps come
from the feet of a walker. This paper proposes an audio-visual sound event
localization and detection (SELD) task, which uses multichannel audio and video
information to estimate the temporal activation and DOA of target sound events.
Audio-visual SELD systems can detect and localize sound events using signals
from a microphone array and audio-visual correspondence. We also introduce an
audio-visual dataset, Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23),
which consists of multichannel audio data recorded with a microphone array,
video data, and spatiotemporal annotation of sound events. Sound scenes in
STARSS23 are recorded with instructions, which guide recording participants to
ensure adequate activity and occurrences of sound events. STARSS23 also serves
human-annotated temporal activation labels and human-confirmed DOA labels,
which are based on tracking results of a motion capture system. Our benchmark
results show that the audio-visual SELD system achieves lower localization
error than the audio-only system. The data is available at
this https URL

本文提出了一个音频 - 视频声音事件本地化和检测（SELD）任务，它使用多通道音频和视频信息来估计目标声音事件的时间激活和 DOA。音频 - 视觉 SELD 系统可以使用来自麦克风阵列和音频 - 视觉对应的信号来检测和定位声音事件，并介绍了一个音频 - 视觉数据集，其中包含了用于监测人员活动和声音事件发生的多通道音频数据记录。

STARSS23: 具有声音事件的时空注释的真实场景空间录音的音频 - 视觉数据集

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes  with Spatiotemporal Annotations of Sound Events

This paper describes our submission to ICASSP 2022 Multi-channel Multi-party
Meeting Transcription (M2MeT) Challenge. For Track 1, we propose several
approaches to empower the clustering-based speaker diarization system to handle
overlapped speech. Front-end dereverberation and the direction-of-arrival (DOA)
estimation are used to improve the accuracy of speaker diarization.
Multi-channel combination and overlap detection are applied to reduce the
missed speaker error. A modified DOVER-Lap is also proposed to fuse the results
of different systems. We achieve the final DER of 5.79% on the Eval set and
7.23% on the Test set. For Track 2, we develop our system using the Conformer
model in a joint CTC-attention architecture. Serialized output training is
adopted to multi-speaker overlapped speech recognition. We propose a neural
front-end module to model multi-channel audio and train the model end-to-end.
Various data augmentation methods are utilized to mitigate over-fitting in the
multi-channel multi-speaker E2E system. Transformer language model fusion is
developed to achieve better performance. The final CER is 19.2% on the Eval set
and 20.8% on the Test set.

本文介绍了我们在 ICASSP 2022 M2MeT 挑战中的参赛作品。对于 Track 1，我们提出多种方法来加强聚类式说话人分割系统以应对重叠语音；对于 Track 2，我们采用 Conformer 模型和神经前端模块来训练系统，实现多声道混叠语音识别，最终取得了比较好的性能表现。

ICASSP 2022 多通道多方会议转录挑战赛的 Volcspeech 系统

The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

The incoherence between measurement and sparsifying transform matrices and
the restricted isometry property (RIP) of measurement matrix are two of the key
factors in determining the performance of compressive sensing (CS). In CS-MRI,
the randomly under-sampled Fourier matrix is used as the measurement matrix and
the wavelet transform is usually used as sparsifying transform matrix. However,
the incoherence between the randomly under-sampled Fourier matrix and the
wavelet matrix is not optimal, which can deteriorate the performance of CS-MRI.
Using the mathematical result that noiselets are maximally incoherent with
wavelets, this paper introduces the noiselet unitary bases as the measurement
matrix to improve the incoherence and RIP in CS-MRI, and presents a method to
design the pulse sequence for the noiselet encoding. This novel encoding scheme
is combined with the multichannel compressive sensing (MCS) framework to take
the advantage of multichannel data acquisition used in MRI scanners. An
empirical RIP analysis is presented to compare the multichannel noiselet and
multichannel Fourier measurement matrices in MCS. Simulations are presented in
the MCS framework to compare the performance of noiselet encoding
reconstructions and Fourier encoding reconstructions at different acceleration
factors. The comparisons indicate that multichannel noiselet measurement matrix
has better RIP than that of its Fourier counterpart, and that noiselet encoded
MCS-MRI outperforms Fourier encoded MCS-MRI in preserving image resolution and
can achieve higher acceleration factors. To demonstrate the feasibility of the
proposed noiselet encoding scheme, two pulse sequences with tailored spatially
selective RF excitation pulses was designed and implemented on a 3T scanner to
acquire the data in the noiselet domain from a phantom and a human brain.

本文介绍噪声基编码方案应用于多通道 MRI 数据采集，采用数学理论设计噪声编码器优化传统编码器的测量矩阵与稀疏变换矩阵间的不一致性与 RIP，并在实验中证实其在提高图像恢复精度和加速因素方面的显著性。