In NLP, text language models based on words or subwords are known to
outperform their character-based counterparts. Yet, in the speech community,
the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter
than a phoneme). Taking inspiration from word-based LM, we introduce a
Generative Spoken Language Model (GSLM) based on word-size continuous-valued
audio embeddings that can generate diverse and expressive language output. This
is obtained by replacing lookup table for lexical types with a Lexical
Embedding function, the cross entropy loss by a contrastive loss, and
multinomial sampling by k-NN sampling. The resulting model is the first
generative language model based on word-size continuous embeddings. Its
performance is on par with discrete unit GSLMs regarding generation quality as
measured by automatic metrics and subjective human judgements. Moreover, it is
five times more memory efficient thanks to its large 200ms units. In addition,
the embeddings before and after the Lexical Embedder are phonetically and
semantically interpretable.

基于连续值音频嵌入的生成式口语语言模型（GSLM）通过引入词大小连续嵌入函数、对比损失和 k-NN 采样，取得了多样性和富有表现力的语言生成；该模型与离散单元 GSLM 在生成质量方面表现相当，同时内存效率提高了五倍；此外，词嵌入前后的嵌入具有音韵和语义解释性。