sequence-to-sequence models are commonly trained via maximum likelihood
estimation (MLE). However, standard MLE training considers a word-level
objective, predicting the next word given the previous ground-truth partial
sentence. This procedure focuses on modeling local syntactic patte