We propose EmoDistill, a novel speech emotion recognition (SER) framework
that leverages cross-modal knowledge distillation during training to learn
strong linguistic and prosodic representations of emotion from speech. During
inference, our method only uses a stream of speech signals to perform unimodal
SER thus reducing computation overhead and avoiding run-time transcription and
prosodic feature extraction errors. During training, our method distills
information at both embedding and logit levels from a pair of pre-trained
Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on
the IEMOCAP benchmark demonstrate that our method outperforms other unimodal
and multimodal techniques by a considerable margin, and achieves
state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted
accuracy. Detailed ablation studies demonstrate the impact of each component of
our method.

EmoDistill 是一个新颖的语音情感识别（SER）框架，利用跨模态知识蒸馏在训练期间从语音中学习强大的语言和韵律情感表示。在推断过程中，我们的方法仅使用一系列语音信号执行单模态 SER，从而减少计算开销并避免运行时转录和韵律特征提取错误。在 IEMOCAP 基准上的实验证明，我们的方法以相当大的优势胜过其他单模态和多模态技术，并实现了 77.49％的非加权准确率和 78.91％的加权准确率。详细的消融研究展示了我们方法的每个组成部分的影响。

通过提炼韵律和语言情感表达的语音情感识别

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect  Representations

A large part of the expressive speech synthesis literature focuses on
learning prosodic representations of the speech signal which are then modeled
by a prior distribution during inference. In this paper, we compare different
prior architectures at the task of predicting phoneme level prosodic
representations extracted with an unsupervised FVAE model. We use both
subjective and objective metrics to show that normalizing flow based prior
networks can result in more expressive speech at the cost of a slight drop in
quality. Furthermore, we show that the synthesized speech has higher
variability, for a given text, due to the nature of normalizing flows. We also
propose a Dynamical VAE model, that can generate higher quality speech although
with decreased expressiveness and variability compared to the flow based
models.

本文比较了不同架构（prior architectures）在预测从 FVAE 模型中提取的音素级韵律表示方面的表现，并使用主观和客观指标证明了基于正规化流的先验网络可以在表现力方面产生更加生动的语音，并提出了一个动态 VAE 模型与基于流的模型相比，尽管在表现力和变异性上有所减少，但可以产生更高质量的语音。

使用 AR 和基于流的先验网络预测音素级韵律潜变量用于表现力语音合成

Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis

This paper proposes an Expressive Speech Synthesis model that utilizes
token-level latent prosodic variables in order to capture and control
utterance-level attributes, such as character acting voice and speaking style.
Current works aim to explicitly factorize such fine-grained and utterance-level
speech attributes into different representations extracted by modules that
operate in the corresponding level. We show that the fine-grained latent space
also captures coarse-grained information, which is more evident as the
dimension of latent space increases in order to capture diverse prosodic
representations. Therefore, a trade-off arises between the diversity of the
token-level and utterance-level representations and their disentanglement. We
alleviate this issue by first capturing rich speech attributes into a
token-level latent space and then, separately train a prior network that given
the input text, learns utterance-level representations in order to predict the
phoneme-level, posterior latents extracted during the previous step. Both
qualitative and quantitative evaluations are used to demonstrate the
effectiveness of the proposed approach. Audio samples are available in our demo
page.

本论文提出了一种表达性语音合成模型，该模型利用标记级别的潜在韵律变量来捕捉和控制话语级别属性，如角色配音和说话风格，其中的潜在细节级别空间同时也捕捉更粗粒度的信息。