Speech restoration aims at restoring high quality speech in the presence of a
diverse set of distortions. Although several deep learning paradigms have been
studied for this task, the power of the recently emerging language models has
not been fully explored. In this paper, we propose MaskSR, a masked language
model capable of restoring full-band 44.1 kHz speech jointly considering noise,
reverb, clipping, and low bandwidth. MaskSR works with discrete acoustic tokens
extracted using a pre-trained neural codec. During training, MaskSR is
optimized to predict randomly masked tokens extracted from the high quality
target speech, conditioned on the corrupted speech with various distortions.
During inference, MaskSR reconstructs the target speech tokens with efficient
iterative sampling. Extensive experiments show that MaskSR obtains competitive
results on both the full-band speech restoration task and also on sub-tasks
compared with a wide range of models.

语音恢复是在各种失真的情况下恢复高品质语音的目标。本文提出了一种名为 MaskSR 的掩码语言模型，能够联合考虑噪声、混响、剪切和低带宽来恢复全频 44.1 kHz 的语音。MaskSR 利用预训练的神经编解码器提取离散声学令牌。在训练过程中，MaskSR 被优化为根据带有各种失真的损坏语音，预测从高品质目标语音中随机屏蔽的令牌。在推断过程中，MaskSR 通过高效的迭代采样重建目标语音令牌。大量实验证明，与各种模型相比，MaskSR 在全频语音恢复任务和子任务上都取得了竞争力的结果。

MaskSR：面向全频段语音恢复的 Masked Language Model

MaskSR: Masked Language Model for Full-band Speech Restoration

Unsupervised discovery of acoustic tokens from audio corpora without
annotation and learning vector representations for these tokens have been
widely studied. Although these techniques have been shown successful in some
applications such as query-by-example Spoken Term Detection (STD), the lack of
mapping relationships between these discovered tokens and real phonemes have
limited the down-stream applications. This paper represents probably the first
attempt towards the goal of completely unsupervised phoneme recognition, or
mapping audio signals to phoneme sequences without phoneme-labeled audio data.
The basic idea is to cluster the embedded acoustic tokens and learn the mapping
between the cluster sequences and the unknown phoneme sequences with a
Generative Adversarial Network (GAN). An unsupervised phoneme recognition
accuracy of 36% was achieved in the preliminary experiments.

本文提出一种利用生成对抗网络进行无监督音素识别的方法，并取得了 36% 的准确率。