Mapping two modalities, speech and text, into a shared representation space,
is a research topic of using text-only data to improve end-to-end automatic
speech recognition (ASR) performance in new domains. However, the length of
speech representation and text representation is inconsistent. Although the
previous method up-samples the text representation to align with acoustic
modality, it may not match the expected actual duration. In this paper, we
proposed novel representations match strategy through down-sampling acoustic
representation to align with text modality. By introducing a continuous
integrate-and-fire (CIF) module generating acoustic representations consistent
with token length, our ASR model can learn unified representations from both
modalities better, allowing for domain adaptation using text-only data of the
target domain. Experiment results of new domain data demonstrate the
effectiveness of the proposed method.

本研究旨在通过引入一个连续的整合 - 发火 (CIF) 模块，从而实现将语音和文本这两种形式的信息映射到共享表示空间，以提高自动语音识别 (ASR) 在新领域中的性能。通过将一个具有一致语音标记长度的 CIF 模块与基于文本的 ASR 模型相结合，我们成功实现了统一的双模态表示学习，从而允许使用目标领域的纯文本数据进行域适应。实验结果表明了该方法在新领域数据上的有效性。

通过下采样声学表示进行端到端语音识别的纯文本领域自适应

Text-Only Domain Adaptation for End-to-End Speech Recognition through  Down-Sampling Acoustic Representation

We learn rich natural sound representations by capitalizing on large amounts
of unlabeled sound data collected in the wild. We leverage the natural
synchronization between vision and sound to learn an acoustic representation
using two-million unlabeled videos. Unlabeled video has the advantage that it
can be economically acquired at massive scales, yet contains useful signals
about natural sound. We propose a student-teacher training procedure which
transfers discriminative visual knowledge from well established visual
recognition models into the sound modality using unlabeled video as a bridge.
Our sound representation yields significant performance improvements over the
state-of-the-art results on standard benchmarks for acoustic scene/object
classification. Visualizations suggest some high-level semantics automatically
emerge in the sound network, even though it is trained without ground truth
labels.

通过使用大量的野外未标记的声音数据，我们利用视觉和声音之间的自然同步，使用两百万个未标记的视频学习声学表示，提出了一种学生 - 教师训练过程，将视觉知识转移到声音模态中，为声音场景 / 对象分类的标准基准提供了显着的性能提升，即使没有地面真实标签，声音网络自动形成一些高级语义。