Deliberation networks are a family of sequence-to-sequence models, which have
achieved state-of-the-art performance in a wide range of tasks such as machine
translation and speech synthesis. A deliberation network consists of multiple
standard sequence-to-sequence models, each one conditioned on the initial input
and the output of the previous model. During training, there are several key
questions: whether to apply Monte Carlo approximation to the gradients or the
loss, whether to train the standard models jointly or separately, whether to
run an intermediate model in teacher forcing or free running mode, whether to
apply task-specific techniques. Previous work on deliberation networks
typically explores one or two training options for a specific task. This work
introduces a unifying framework, covering various training options, and
addresses the above questions. In general, it is simpler to approximate the
gradients. When parallel training is essential, separate training should be
adopted. Regardless of the task, the intermediate model should be in free
running mode. For tasks where the output is continuous, a guided attention loss
can be used to prevent degradation into a standard model.

本研究探讨 “deliberation network” 家族的各种训练选项，并提供了一个统一框架，建议在并行训练时采用分别训练的方式，对于中间模型应在自由运行模式下，对于连续输出任务，可采用引导注意损失以防止退化为标准模型。