Conventional representation learning algorithms for knowledge graphs (KG) map
each entity to a unique embedding vector. Such a shallow lookup results in a
linear growth of memory consumption for storing the embedding matrix and incurs
high computational costs when working with real-world KGs. Drawing parallels
with subword tokenization commonly used in NLP, we explore the landscape of
more parameter-efficient node embedding strategies with possibly sublinear
memory requirements. To this end, we propose NodePiece, an anchor-based
approach to learn a fixed-size entity vocabulary. In NodePiece, a vocabulary of
subword/sub-entity units is constructed from anchor nodes in a graph with known
relation types. Given such a fixed-size vocabulary, it is possible to bootstrap
an encoding and embedding for any entity, including those unseen during
training. Experiments show that NodePiece performs competitively in node
classification, link prediction, and relation prediction tasks while retaining
less than 10% of explicit nodes in a graph as anchors and often having 10x
fewer parameters. To this end, we show that a NodePiece-enabled model
outperforms existing shallow models on a large OGB WikiKG 2 graph having 70x
fewer parameters.

本文通过借鉴 NLP 中常用的次词元素处理技术，探索更具参数效率的节点嵌入策略，提出了一种基于锚节点的方法 NodePiece，构建了一个定长的子实体单元词汇表，展示了该方法的性能在节点分类、链路预测和关系预测任务中具有竞争力且参数更少。

NodePiece：大规模知识图谱的组合和参数高效表示

NodePiece: Compositional and Parameter-Efficient Representations of  Large Knowledge Graphs

Contextual word-representations became a standard in modern natural language
processing systems. These models use subword tokenization to handle large
vocabularies and unknown words. Word-level usage of such systems requires a way
of pooling multiple subwords that correspond to a single word. In this paper we
investigate how the choice of subword pooling affects the downstream
performance on three tasks: morphological probing, POS tagging and NER, in 9
typologically diverse languages. We compare these in two massively multilingual
models, mBERT and XLM-RoBERTa. For morphological tasks, the widely used `choose
the first subword' is the worst strategy and the best results are obtained by
using attention over the subwords. For POS tagging both of these strategies
perform poorly and the best choice is to use a small LSTM over the subwords.
The same strategy works best for NER and we show that mBERT is better than
XLM-RoBERTa in all 9 languages. We publicly release all code, data and the full
result tables at https://github.com/juditacs/subword-choice.

探讨分词和子词池化对两个大规模多语言模型在词汇计量、词性标注和命名实体识别等三个任务中的影响，并提出使用小型 LSTM 模型对子词进行池化处理的最佳方案。

Subword 池化有所不同

Subword Pooling Makes a Difference

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword
tokenization process of language models as it provides multiple benefits.
However, this process is solely based on pre-training data statistics, making
it hard for the tokenizer to handle infrequent spellings. On the other hand,
though robust to misspellings, pure character-level models often lead to
unreasonably long sequences and make it harder for the model to learn
meaningful words. To alleviate these challenges, we propose a character-based
subword module (char2subword) that learns the subword embedding table in
pre-trained models like BERT. Our char2subword module builds representations
from characters out of the subword vocabulary, and it can be used as a drop-in
replacement of the subword embedding table. The module is robust to
character-level alterations such as misspellings, word inflection, casing, and
punctuation. We integrate it further with BERT through pre-training while
keeping BERT transformer parameters fixed--and thus, providing a practical
method. Finally, we show that incorporating our module to mBERT significantly
improves the performance on the social media linguistic code-switching
evaluation (LinCE) benchmark.

提出一种基于字符的子词模块 (char2subword)，它可以学习预训练模型 (BERT) 中的子词嵌入表，并通过预训练进一步集成到 BERT 中，从而显著提高在社交媒体语言代码切换评估 (LinCE) 的表现。

Char2Subword：利用强健的字符组合扩展子词嵌入空间

Char2Subword: Extending the Subword Embedding Space Using Robust  Character Compositionality

The success of pretrained transformer language models (LMs) in natural
language processing has led to a wide range of pretraining setups. In
particular, these models employ a variety of subword tokenization methods, most
notably byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994), the
WordPiece method (Schuster and Nakajima, 2012), and unigram language modeling
(Kudo, 2018), to segment text. However, to the best of our knowledge, the
literature does not contain a direct evaluation of the impact of tokenization
on language model pretraining. We analyze differences between BPE and unigram
LM tokenization, finding that the latter method recovers subword units that
align more closely with morphology and avoids problems stemming from BPE's
greedy construction procedure. We then compare the fine-tuned task performance
of identical transformer masked language models pretrained with these
tokenizations. Across downstream tasks and two languages (English and
Japanese), we find that the unigram LM tokenization method matches or
outperforms BPE. We hope that developers of future pretrained LMs will consider
adopting the unigram LM method over the more prevalent BPE.

分析使用不同词汇分割方法，如 BPE 和 unigram 在提前训练 Transformer 语言模型时对其细微的影响，并比较它们的效果，在任务绩效中发现 unigram 的方法匹配或优于 BPE，建议开发者在预训练时采用 unigram 方法。