Transformer-based speech self-supervised learning (SSL) models, such as
HuBERT, show surprising performance in various speech processing tasks.
However, huge number of parameters in speech SSL models necessitate the
compression to a more compact model for wider usage in academia or small
companies. In this study, we suggest to reuse attention maps across the
Transformer layers, so as to remove key and query parameters while retaining
the number of layers. Furthermore, we propose a novel masking distillation
strategy to improve the student model's speech representation quality. We
extend the distillation loss to utilize both masked and unmasked speech frames
to fully leverage the teacher model's high-quality representation. Our
universal compression strategy yields the student model that achieves phoneme
error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB
benchmark.

本研究提出基于 Transformer 的语音自监督学习模型压缩方法，包括重用注意力矩阵并采用新型蒸馏策略。我们的通用压缩策略可在 SUPERB 基准测试中实现 7.72％ 的音素错误率和 9.96％ 的单词错误率。

Recycle-and-Distill: 基于注意力映射重用和掩码蒸馏的 Transformer 语音 SSL 模型通用压缩策略

Recycle-and-Distill: Universal Compression Strategy for  Transformer-based Speech SSL Models with Attention Map Reusing and Masking  Distillation

Grapheme-to-phoneme conversion (g2p) is necessary for text-to-speech and
automatic speech recognition systems. Most g2p systems are monolingual: they
require language-specific data or handcrafting of rules. Such systems are
difficult to extend to low resource languages, for which data and handcrafted
rules are not available. As an alternative, we present a neural
sequence-to-sequence approach to g2p which is trained on
spelling--pronunciation pairs in hundreds of languages. The system shares a
single encoder and decoder across all languages, allowing it to utilize the
intrinsic similarities between different writing systems. We show an 11%
improvement in phoneme error rate over an approach based on adapting
high-resource monolingual g2p models to low-resource languages. Our model is
also much more compact relative to previous approaches.

本文提出了使用神经序列到序列模型进行语音转换的方法，这个方法可以用在多种语言上，并且相比于基于高资源单语言模型适应低资源语言的方法，我们的方法在语音识别上的表现有显著提升，同时我们的模型更加紧凑。

高度多语言神经音素转写

Massively Multilingual Neural Grapheme-to-Phoneme Conversion

Recurrent sequence generators conditioned on input data through an attention
mechanism have recently shown very good performance on a range of tasks in-
cluding machine translation, handwriting synthesis and image caption gen-
eration. We extend the attention-mechanism with features needed for speech
recognition. We show that while an adaptation of the model used for machine
translation in reaches a competitive 18.7% phoneme error rate (PER) on the
TIMIT phoneme recognition task, it can only be applied to utterances which are
roughly as long as the ones it was trained on. We offer a qualitative
explanation of this failure and propose a novel and generic method of adding
location-awareness to the attention mechanism to alleviate this issue. The new
method yields a model that is robust to long inputs and achieves 18% PER in
single utterances and 20% in 10-times longer (repeated) utterances. Finally, we
propose a change to the at- tention mechanism that prevents it from
concentrating too much on single frames, which further reduces PER to 17.6%
level.

本研究提出了一种基于改进的注意力机制加上位置感知的模型，解决了长输入音频识别中的问题并且有效降低了音素错误率。