Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as "innoventor", are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV blends to quantify the difficulty of interpreting the meanings of blends by large-scale contextual language models such as BERT. We first show that BERT's processing of these blends does not fully access the component meanings, leaving their contextual representations semantically impoverished. We find this is mostly due to the loss of characters resulting from blend formation. Then, we assess how easily different models can recognize the structure and recover the origin of blends, and find that context-aware embedding systems outperform character-level and context-free embeddings, although their results are still far from satisfactory.

本文中，我们通过对一个新数据集的实验来量化解释混合词的含义的难度，结果表明，BERT对这些混合词的处理不能充分访问其组成部分的含义，导致其上下文表示语义贫乏，而具有上下文感知能力的嵌入式系统在识别混合词的结构和恢复其来源方面表现优异，但其结果仍然远非令人满意。

能否混合？