Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled. This proves that the patterns found are not due to real semantic associations present in the texts, but are an artifact produced by the word embeddings. Finally, we provide an example of how word frequency can strongly impact the measurement of gender bias with embedding-based metrics. In particular, we carry out a controlled experiment that shows that biases can even change sign or reverse their order by manipulating word frequencies.

本文系统研究了几种静态词向量嵌入中单词频率与语义相似性之间的关联，并发现高频单词之间的相似性更高。同时，本文还探究了单词频率对基于嵌入的性别偏见测量的影响，并证明通过操纵单词频率可使偏见发生倒转。

词嵌入相似度计算方式的频率依赖性