Transformer-based text to speech (TTS) model (e.g., Transformer
TTS~\cite{li2019neural}, FastSpeech~\cite{ren2019fastspeech}) has shown the
advantages of training and inference efficiency over RNN-based model (e.g.,
Tacotron~\cite{shen2018natural}) due to its parallel computation in training
and/or inference. However, the parallel computation increases the d