Multilingual speech recognition with supervised learning has achieved great
results as reflected in recent research. With the development of pretraining
methods on audio and text data, it is imperative to transfer the knowledge from
unsupervised multilingual models to facilitate recognition, especially in many
languages with limited data. Our work investigated the effectiveness of using
two pretrained models for two modalities: wav2vec 2.0 for audio and MBART50 for
text, together with the adaptive weight techniques to massively improve the
recognition quality on the public datasets containing CommonVoice and Europarl.
Overall, we noticed an 44% improvement over purely supervised learning, and
more importantly, each technique provides a different reinforcement in
different languages. We also explore other possibilities to potentially obtain
the best model by slightly adding either depth or relative attention to the
architecture.

使用预训练的 wav2vec 2.0 和 MBART50 模型，结合自适应权重技术，显著提高公共数据集上多语言语音识别的准确性，比纯监督学习提高 44%。同时我们还探索了如何通过微调结构以获得最佳模型。

使用预训练模型的自适应多语言语音识别

Adaptive multilingual speech recognition with pretrained models

In this work, we explore a multimodal semi-supervised learning approach for
punctuation prediction by learning representations from large amounts of
unlabelled audio and text data. Conventional approaches in speech processing
typically use forced alignment to encoder per frame acoustic features to word
level features and perform multimodal fusion of the resulting acoustic and
lexical representations. As an alternative, we explore attention based
multimodal fusion and compare its performance with forced alignment based
fusion. Experiments conducted on the Fisher corpus show that our proposed
approach achieves ~6-9% and ~3-4% absolute improvement (F1 score) over the
baseline BLSTM model on reference transcripts and ASR outputs respectively. We
further improve the model robustness to ASR errors by performing data
augmentation with N-best lists which achieves up to an additional ~2-6%
improvement on ASR outputs. We also demonstrate the effectiveness of
semi-supervised learning approach by performing ablation study on various sizes
of the corpus. When trained on 1 hour of speech and text data, the proposed
model achieved ~9-18% absolute improvement over baseline model.

本研究探索一种多模态半监督学习方法，通过学习大量无标签的音频和文本数据来预测标点符号。实验结果表明，使用注意力机制的多模态融合相对于使用强制对齐的多模态融合可以使基线模型分别在参考转录和自动语音识别输出上达到约 6-9％和 3-4％的绝对改进（F1 分数），数据增广也可以使模型对 ASR 错误更加鲁棒。

面向对话语音标点预测的多模态半监督学习框架

Multimodal Semi-supervised Learning Framework for Punctuation Prediction  in Conversational Speech

Speech emotion recognition is a challenging task, and extensive reliance has
been placed on models that use audio features in building well-performing
classifiers. In this paper, we propose a novel deep dual recurrent encoder
model that utilizes text data and audio signals simultaneously to obtain a
better understanding of speech data. As emotional dialogue is composed of sound
and spoken content, our model encodes the information from audio and text
sequences using dual recurrent neural networks (RNNs) and then combines the
information from these sources to predict the emotion class. This architecture
analyzes speech data from the signal level to the language level, and it thus
utilizes the information within the data more comprehensively than models that
focus on audio features. Extensive experiments are conducted to investigate the
efficacy and properties of the proposed model. Our proposed model outperforms
previous state-of-the-art methods in assigning data to one of four emotion
categories (i.e., angry, happy, sad and neutral) when the model is applied to
the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.

本文提出了一种深度双重循环编码器模型，利用语音和文本数据进行机器情感识别，该模型表现更优，实验结果显示，当将该模型应用于 IEMOCAP 数据集时，在将数据分配到四个情感类别（愤怒，高兴，悲伤和中性）方面，准确率在 68.8％至 71.8％之间。