Learned representations at the level of characters, sub-words, words and
sentences, have each contributed to advances in understanding different NLP
tasks and linguistic phenomena. However, learning textual embeddings is costly
as they are tokenization specific and require different models to be trained
for each level of abstraction. We introduce a novel language representation
model which can learn to compress to different levels of abstraction at
different layers of the same model. We apply Nonparametric Variational
Information Bottleneck (NVIB) to stacked Transformer self-attention layers in
the encoder, which encourages an information-theoretic compression of the
representations through the model. We find that the layers within the model
correspond to increasing levels of abstraction and that their representations
are more linguistically informed. Finally, we show that NVIB compression
results in a model which is more robust to adversarial perturbations.

本论文介绍了一种能够学习在同一模型的不同层次进行不同抽象级别压缩的语言表示模型，并通过在编码器的堆叠 Transformer 自注意力层中应用非参数变分信息瓶颈 (NVIB) 来促进表示的信息理论压缩。论文发现模型内的不同层次对应于不断增加的抽象级别，并且它们的表示更具有语言学信息。最后，实验证明 NVIB 压缩能够产生更具鲁棒性的模型，面对对抗性扰动更加稳健。