Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.

研究采用深度学习技术解决音视频语音增强任务时，目标量和目标函数的选择对性能至关重要；本实验研究了一系列不同的目标量和目标函数，结果表明直接估计掩模的方法在估计语音质量和可懂度方面表现最佳。

深度学习音-视觉语音增强的训练目标和目标函数