Recently, recurrent neural networks (RNNs) as powerful sequence models have re-emerged as a potential acoustic model for statistical parametric speech synthesis (SPSS). The long short-term memory (LSTM) architecture is particularly attractive because it addresses the vanishing gradient problem in standard RNNs, making them easier to train. Although recent studies have demonstrated that LSTMs can achieve significantly better performance on SPSS than deep feed-forward neural networks, little is known about why. Here we attempt to answer two questions: a) why do LSTMs work well as a sequence model for SPSS; b) which component (e.g., input gate, output gate, forget gate) is most important. We present a visual analysis alongside a series of experiments, resulting in a proposal for a simplified architecture. The simplified architecture has significantly fewer parameters than an LSTM, thus reducing generation complexity considerably without degrading quality.

本研究旨在回答两个问题：a）为什么长短期记忆（LSTM）作为一种序列模型在SPSS中表现良好；b）哪个元素（例如，输入门，输出门，遗忘门）最重要。 通过一系列实验以及视觉分析，我们提出了一种简化的架构，比LSTM具有较少的参数，从而大大降低了生成一般的复杂性而不降低质量。

探索用于语音合成的门控循环神经网络