Producing a large amount of annotated speech data for training ASR systems
remains difficult for more than 95% of languages all over the world which are
low-resourced. However, we note human babies start to learn the language by the
sounds of a small number of exemplar words without hearing a large amount of
data. We initiate some preliminary work in this direction in this paper. Audio
Word2Vec is used to obtain embeddings of spoken words which carry phonetic
information extracted from the signals. An autoencoder is used to generate
embeddings of text words based on the articulatory features for the phoneme
sequences. Both sets of embeddings for spoken and text words describe similar
phonetic structures among words in their respective latent spaces. A mapping
relation from the audio embeddings to text embeddings actually gives the
word-level ASR. This can be learned by aligning a small number of spoken words
and the corresponding text words in the embedding spaces. In the initial
experiments only 200 annotated spoken words and one hour of speech data without
annotation gave a word accuracy of 27.5%, which is low but a good starting
point.

利用音频字向量和自编码器实现跨模态的语音识别，演示了即使缺乏训练数据，也可以从少量音频和文本之间的嵌入对齐中进行 ASR 系统的训练。

基于语音和文本数据的音位结构学习，较少资源近乎无监督的语音识别

Almost-unsupervised Speech Recognition with Close-to-zero Resource Based  on Phonetic Structures Learned from Very Small Unpaired Speech and Text Data

Word embedding or Word2Vec has been successful in offering semantics for text
words learned from the context of words. Audio Word2Vec was shown to offer
phonetic structures for spoken words (signal segments for words) learned from
signals within spoken words. This paper proposes a two-stage framework to
perform phonetic-and-semantic embedding on spoken words considering the context
of the spoken words. Stage 1 performs phonetic embedding with speaker
characteristics disentangled. Stage 2 then performs semantic embedding in
addition. We further propose to evaluate the phonetic-and-semantic nature of
the audio embeddings obtained in Stage 2 by parallelizing with text embeddings.
In general, phonetic structure and semantics inevitably disturb each other. For
example the words "brother" and "sister" are close in semantics but very
different in phonetic structure, while the words "brother" and "bother" are in
the other way around. But phonetic-and-semantic embedding is attractive, as
shown in the initial experiments on spoken document retrieval. Not only spoken
documents including the spoken query can be retrieved based on the phonetic
structures, but spoken documents semantically related to the query but not
including the query can also be retrieved based on the semantics.

本文介绍了一种两阶段框架，用于考虑口语单词的上下文执行音素语义嵌入，第一阶段执行音素嵌入，第二阶段执行语义嵌入，我们进一步提出了通过文本嵌入并行评估在第二阶段获得的音频嵌入的音素和语义性质。

口语词汇的音义嵌入及其在口语内容检索中的应用

Phonetic-and-Semantic Embedding of Spoken Words with Applications in  Spoken Content Retrieval

The vector representations of fixed dimensionality for words (in text)
offered by Word2Vec have been shown to be very useful in many application
scenarios, in particular due to the semantic information they carry. This paper
proposes a parallel version, the Audio Word2Vec. It offers the vector
representations of fixed dimensionality for variable-length audio segments.
These vector representations are shown to describe the sequential phonetic
structures of the audio segments to a good degree, with very attractive real
world applications such as query-by-example Spoken Term Detection (STD). In
this STD application, the proposed approach significantly outperformed the
conventional Dynamic Time Warping (DTW) based approaches at significantly lower
computation requirements. We propose unsupervised learning of Audio Word2Vec
from audio data without human annotation using Sequence-to-sequence Audoencoder
(SA). SA consists of two RNNs equipped with Long Short-Term Memory (LSTM)
units: the first RNN (encoder) maps the input audio sequence into a vector
representation of fixed dimensionality, and the second RNN (decoder) maps the
representation back to the input audio sequence. The two RNNs are jointly
trained by minimizing the reconstruction error. Denoising Sequence-to-sequence
Autoencoder (DSA) is furthered proposed offering more robust learning.

本文提出了一种并行版本的 Audio Word2Vec，旨在为变长音频片段提供固定维度的向量表示，用于无人工注释的语音数据的无监督学习，并且采用 Denoising Sequence-to-sequence Autoencoder 进行更稳健的学习。