Maha Elbayad, Jiatao Gu, Edouard Grave, Michael Auli
TL;DR本文介绍了一种 Transformer 模型,它可以在网络的不同阶段进行输出预测,并调整每个步骤所应用的不同图层,以调整计算量和模型容量。通过对 IWSLT 德英翻译的实验,我们的方法与 well tuned 换基础变压器的精度相匹配,同时只使用不到四分之一的解码器层数。
Abstract
State of the art sequence-to-sequence models for large scale tasks perform a
fixed number of computations for each input sequence regardless of whether it
is easy or hard to process. In this paper, we train transformer