Deep autoregressive sequence-to-sequence models have demonstrated impressive
performance across a wide variety of tasks in recent years. While common
architecture classes such as recurrent, convolutional, and self-attention
networks make different trade-offs between the amount of compu