We study the settings for which deep contextual embeddings (e.g., BERT) give large improvements in performance relative to classic pretrained embeddings (e.g., GloVe), and an even simpler baseline---random word embeddings---focusing on the impact of the training set size and the linguistic properties of the task. Surprisingly, we find that both of these simpler baselines can match contextual embeddings on industry-scale data, and often perform within 5 to 10% accuracy (absolute) on benchmark tasks. Furthermore, we identify properties of data for which contextual embeddings give particularly large gains: language containing complex structure, ambiguous word usage, and words unseen in training.

本研究旨在探究深度上下文嵌入（例如BERT）相对于传统预训练嵌入（例如GloVe）和一个更简单的基准（随机词嵌入）在训练集大小和语言任务的语言特性等方面，性能是否有大幅提升。我们发现，这两种更简单的基准线上也能匹配行业规模的数据中的上下文嵌入，并且通常在基准任务中具有 5-10％ 左右的精度，此外，我们还确定了一些数据特性，这些特性针对于特定的任务使得上下文嵌入具有大幅提升的表现：包含复杂结构的语言、具有歧义的词汇使用、及在训练中从未出现过的单词。

上下文嵌入：何时值得使用？