Code-switching is a widely prevalent linguistic phenomenon in multilingual
societies like India. Building speech-to-text models for code-switched speech
is challenging due to limited availability of datasets. In this work, we focus
on the problem of spoken translation (ST) of code-switched speech in Indian
languages to English text. We present a new end-to-end model architecture COSTA
that scaffolds on pretrained automatic speech recognition (ASR) and machine
translation (MT) modules (that are more widely available for many languages).
Speech and ASR text representations are fused using an aligned interleaving
scheme and are fed further as input to a pretrained MT module; the whole
pipeline is then trained end-to-end for spoken translation using synthetically
created ST data. We also release a new evaluation benchmark for code-switched
Bengali-English, Hindi-English, Marathi-English and Telugu- English speech to
English text. COSTA significantly outperforms many competitive cascaded and
end-to-end multimodal baselines by up to 3.5 BLEU points.

通过先前训练好的自动语音识别 (ASR) 和机器翻译 (MT) 模块的结合，本研究提出了一种名为 COSTA 的端到端模型架构，用于印度语言到英文文本的混合语言实时翻译，并发布出相关评估基准。COSTA 在混合语言的孟加拉语、印地语、马拉地语和泰卢固语到英文文本的翻译中，相对于其他基线模型表现出显著优势，BLEU 点数提高达 3.5。

CoSTA: 使用对齐的语音文本交替进行混合编码的语音翻译

CoSTA: Code-Switched Speech Translation using Aligned Speech-Text  Interleaving

Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown
high-quality results, but the key challenge of how to semantically align two
embeddings for multi-word keywords of different sequence lengths remains
largely unsolved. In this paper, we propose an audio-text-based end-to-end
model architecture for flexible keyword spotting (KWS), which builds upon
learned audio and text embeddings. Our architecture uses a novel dynamic
programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally
partition the audio sequence into the same length as the word-based text
sequence using the monotonic alignment of spoken content. Our proposed model
consists of an encoder block to get audio and text embeddings, a projector
block to project individual embeddings to a common latent space, and an
audio-text aligner containing a novel DSP algorithm, which aligns the audio and
text embeddings to determine if the spoken content is the same as the text.
Experimental results show that our DSP is more effective than other
partitioning schemes, and the proposed architecture outperformed the
state-of-the-art results on the public dataset in terms of Area Under the ROC
Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively.

本文提出了一种基于语音 - 文本嵌入的端到端模型的架构，使用动态规划算法将音频序列与基于单词的文本序列相同长度地划分，并提出了 DSP 方法，实现了音频 - 文本的对齐，实验结果表明，该模型在 ROC 曲线下的面积和等误差率方面优于现有技术 14.4% 和 28.9%。