Self-supervised learning has been used to leverage unlabelled data, improving
accuracy and generalisation of speech systems through the training of
representation models. While many recent works have sought to produce effective
representations across a variety of acoustic domains, languages, modalities and
even simultaneous speakers, these studies have all been limited to
single-channel audio recordings. This paper presents Spatial HuBERT, a
self-supervised speech representation model that learns both acoustic and
spatial information pertaining to a single speaker in a potentially noisy
environment by using multi-channel audio inputs. Spatial HuBERT learns
representations that outperform state-of-the-art single-channel speech
representations on a variety of spatial downstream tasks, particularly in
reverberant and noisy environments. We also demonstrate the utility of the
representations learned by Spatial HuBERT on a speech localisation downstream
task. Along with this paper, we publicly release a new dataset of 100 000
simulated first-order ambisonics room impulse responses.

Spatial HuBERT 是一种自我监督的语音表示模型，通过使用多通道音频输入学习单个说话者在潜在嘈杂环境中的声学和空间信息，可以在多种空间下游任务中优于最先进的单通道语音表示，在混响和嘈杂环境中表现出色。

Spatial HuBERT：基于多通道音频的自监督单讲话者空间语音表征学习

Spatial HuBERT: Self-supervised Spatial Speech Representation Learning  for a Single Talker from Multi-channel Audio

Supervised multi-channel audio source separation requires extracting useful
spectral, temporal, and spatial features from the mixed signals. The success of
many existing systems is therefore largely dependent on the choice of features
used for training. In this work, we introduce a novel multi-channel,
multi-resolution convolutional auto-encoder neural network that works on raw
time-domain signals to determine appropriate multi-resolution features for
separating the singing-voice from stereo music. Our experimental results show
that the proposed method can achieve multi-channel audio source separation
without the need for hand-crafted features or any pre- or post-processing.

本研究提出了一种基于多通道，多分辨率卷积自编码神经网络，通过对原始时域信号进行处理以确定适用于从立体声音乐中分离歌唱声的多分辨率特征，实验结果表明该方法可以实现多通道音频源分离，无需任何手工制作的特征或任何预处理或后处理。

多分辨率卷积自编码器实现的原始多通道音频源分离

Raw Multi-Channel Audio Source Separation using Multi-Resolution  Convolutional Auto-Encoders

Integration of multiple microphone data is one of the key ways to achieve
robust speech recognition in noisy environments or when the speaker is located
at some distance from the input device. Signal processing techniques such as
beamforming are widely used to extract a speech signal of interest from
background noise. These techniques, however, are highly dependent on prior
spatial information about the microphones and the environment in which the
system is being used. In this work, we present a neural attention network that
directly combines multi-channel audio to generate phonetic states without
requiring any prior knowledge of the microphone layout or any explicit signal
preprocessing for speech enhancement. We embed an attention mechanism within a
Recurrent Neural Network (RNN) based acoustic model to automatically tune its
attention to a more reliable input source. Unlike traditional multi-channel
preprocessing, our system can be optimized towards the desired output in one
step. Although attention-based models have recently achieved impressive results
on sequence-to-sequence learning, no attention mechanisms have previously been
applied to learn potentially asynchronous and non-stationary multiple inputs.
We evaluate our neural attention model on the CHiME-3 challenge task, and show
that the model achieves comparable performance to beamforming using a purely
data-driven method.

本文提出了一种神经注意力网络来直接合并多通道音频以生成语音状态，无需任何关于麦克风布置的先前知识或任何用于语音增强的显式信号预处理。