BriefGPT.xyz
Dec, 2022
MANTa: 高效基于梯度的分词技术用于鲁棒的端到端语言建模
MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
HTML
PDF
Nathan Godey, Roman Castagné, Éric de la Clergerie, Benoît Sagot
TL;DR
本文介绍了一种名为MANTa的模块,用于自适应神经分词,结果表现出字节级模型的表现和基于子词的模型的速度之间的平衡,并且显式地将序列分段,从而提高了语言模型的鲁棒性。
Abstract
Static subword tokenization algorithms have been an essential component of recent works on
language modeling
. However, their static nature results in important flaws that degrade the models' downstream performance and
r
→