Recent years have seen remarkable progress in speech emotion recognition
(SER), thanks to advances in deep learning techniques. However, the limited
availability of labeled data remains a significant challenge in the field.
Self-supervised learning has recently emerged as a promising solution to
address this challenge. In this paper, we propose the vector quantized masked
autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned
to recognize emotions from speech signals. The VQ-MAE-S model is based on a
masked autoencoder (MAE) that operates in the discrete latent space of a
vector-quantized variational autoencoder. Experimental results show that the
proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on
emotional speech data, outperforms an MAE working on the raw spectrogram
representation and other state-of-the-art methods in SER.

本文介绍了使用自监督学习方法的矢量量化掩模自动编码器模型 VQ-MAE-S，该模型基于离散空间的 向量量化变分自动编码器 中的掩模自动编码器（MAE）对语音信号中的情感进行识别。在 VoxCeleb2 数据集的 预训练 和情感性语音数据的 微调下，该模型在语音情感识别方面比使用原始频谱图的 MAE 和其他先进方法表现更好。

用于语音情感识别的矢量量化掩码自编码器

A vector quantized masked autoencoder for speech emotion recognition

In this paper, we present a multi-modal online person verification system
using both speech and visual signals. Inspired by neuroscientific findings on
the association of voice and face, we propose an attention-based end-to-end
neural network that learns multi-sensory associations for the task of person
verification. The attention mechanism in our proposed network learns to
conditionally select a salient modality between speech and facial
representations that provides a balance between complementary inputs. By virtue
of this capability, the network is robust to missing or corrupted data from
either modality. In the VoxCeleb2 dataset, we show that our method performs
favorably against competing multi-modal methods. Even for extreme cases of
large corruption or an entirely missing modality, our method demonstrates
robustness over other unimodal methods.

用多模态方法，包括语音和图像信号，提出了一个基于神经网络的在线人员验证系统。该网络通过学习多感官关联来达到验证任务的目的，并利用了关注机制来选择具有显著性的模态，以提供补充的输入。在 VoxCeleb2 数据集上，该方法表现出比其他多模态和单模态方法更好的鲁棒性和可靠性。