Applying the Transformer architecture on the character level usually requires very deep architectures that are difficult and slow to train. A few approaches have been proposed that partially overcome this problem by using explicit segmentation into tokens. We show that by initially training a subword model based on this segmentation and then finetuning it on characters, we can obtain a neural machine translation model that works at the character level without requiring segmentation. Without changing the vanilla 6-layer Transformer Base architecture, we train purely character-level models. Our character-level models better capture morphological phenomena and show much higher robustness towards source-side noise at the expense of somewhat worse overall translation quality. Our study is a significant step towards high-performance character-based models that are not extremely large.

实现字符级别的Transformer架构通常需要非常深的架构，难以训练。本文提出一种通过在模型中将分词与字元结合进行初步训练，然后在字符级别上调整，从而实现不需要分词的神经机器翻译模型的方法，并且展示了这种方法更好地捕捉了语言形态现象和更加健壮，训练的代价相对较小。

通过微调子词系统实现合理大小的基于字符级别的Transformer NMT