High-quality data labeling from specific domains is costly and human
time-consuming. In this work, we propose a self-supervised domain adaptation
method, based upon an iterative pseudo-forced alignment algorithm. The produced
alignments are employed to customize an end-to-end Automatic Speech Recognition
(ASR) and iteratively refined. The algorithm is fed with frame-wise character
posteriors produced by a seed ASR, trained with out-of-domain data, and
optimized throughout a Connectionist Temporal Classification (CTC) loss. The
alignments are computed iteratively upon a corpus of broadcast TV. The process
is repeated by reducing the quantity of text to be aligned or expanding the
alignment window until finding the best possible audio-text alignment. The
starting timestamps, or temporal anchors, are produced uniquely based on the
confidence score of the last aligned utterance. This score is computed with the
paths of the CTC-alignment matrix. With this methodology, no human-revised text
references are required. Alignments from long audio files with low-quality
transcriptions, like TV captions, are filtered out by confidence score and
ready for further ASR adaptation. The obtained results, on both the Spanish
RTVE2022 and CommonVoice databases, underpin the feasibility of using CTC-based
systems to perform: highly accurate audio-text alignments, domain adaptation
and semi-supervised training of end-to-end ASR.

本文提出了基于自监督域适应的算法，采用迭代式伪强制对齐算法生成的对齐文本，用于定制端到端自动语音识别，并通过降低文本量或扩展对齐窗口的方法迭代计算文本对齐更新。算法精良地运用帧级字符概率、CTC 损失计算等技术，实现了对主流语音数据库的高精度音频文本对齐、领域自适应和半监督训练。

使用声学 CTC 损失进行迭代伪强制齐次化，以进行自监督 ASR 领域适应

Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation

Machines that can represent and describe environmental soundscapes have
practical potential, e.g., for audio tagging and captioning systems. Prevailing
learning paradigms have been relying on parallel audio-text data, which is,
however, scarcely available on the web. We propose VIP-ANT that induces
\textbf{A}udio-\textbf{T}ext alignment without using any parallel audio-text
data. Our key idea is to share the image modality between bi-modal image-text
representations and bi-modal image-audio representations; the image modality
functions as a pivot and connects audio and text in a tri-modal embedding space
implicitly.
In a difficult zero-shot setting with no paired audio-text data, our model
demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio
classification tasks, and even surpasses the supervised state of the art for
Clotho caption retrieval (with audio queries) by 2.2\% R@1. We further
investigate cases of minimal audio-text supervision, finding that, e.g., just a
few hundred supervised audio-text pairs increase the zero-shot audio
classification accuracy by 8\% on US8K. However, to match human parity on some
zero-shot tasks, our empirical scaling experiments suggest that we would need
about $2^{21} \approx 2M$ supervised audio-caption pairs. Our work opens up new
avenues for learning audio-text connections with little to no parallel
audio-text data.

提出了一种称为 VIP-ANT 的模型，实现了音频文本无对齐数据的自动对齐，应用在零 - shot 音频分类和字幕检索任务中取得了良好的性能，甚至超越了更传统的监督学习模型。同时也发现，虽然仅需一些监督数据就可以提高性能，但达到人类水平仍然需要更大规模的数据。

通过视觉知识转移在无平行数据的情况下，连接音频和文本之间的关联

Connecting the Dots between Audio and Text without Parallel Data through  Visual Knowledge Transfer

Automatic speech recognition (ASR) has been widely researched with supervised
approaches, while many low-resourced languages lack audio-text aligned data,
and supervised methods cannot be applied on them.
In this work, we propose a framework to achieve unsupervised ASR on a read
English speech dataset, where audio and text are unaligned. In the first stage,
each word-level audio segment in the utterances is represented by a vector
representation extracted by a sequence-of-sequence autoencoder, in which
phonetic information and speaker information are disentangled.
Secondly, semantic embeddings of audio segments are trained from the vector
representations using a skip-gram model. Last but not the least, an
unsupervised method is utilized to transform semantic embeddings of audio
segments to text embedding space, and finally the transformed embeddings are
mapped to words.
With the above framework, we are towards unsupervised ASR trained by
unaligned text and speech only.

本文研究无监督语音识别方法，提出由语音向量表示、语义嵌入和无监督转换的框架，这一框架可用于缺乏音频文本对齐数据和受监督方法无法应用的低资源语言。