text word embeddings that encode distributional semantic features work by
modeling contextual similarities of frequently occurring words. Acoustic word
embeddings, on the other hand, typically encode low-level phonetic
similarities. Semantic embeddings for spoken words have been previo