Combining end-to-end speech translation (ST) and non-autoregressive (NAR)
generation is promising in language and speech processing for their advantages
of less error propagation and low latency. In this paper, we investigate the
potential of connectionist temporal classification (CTC) for non-autoregressive
speech translation (NAST). In particular, we develop a model consisting of two
encoders that are guided by CTC to predict the source and target texts,
respectively. Introducing CTC into NAST on both language sides has obvious
challenges: 1) the conditional independent generation somewhat breaks the
interdependency among tokens, and 2) the monotonic alignment assumption in
standard CTC does not hold in translation tasks. In response, we develop a
prediction-aware encoding approach and a cross-layer attention approach to
address these issues. We also use curriculum learning to improve convergence of
training. Experiments on the MuST-C ST benchmarks show that our NAST model
achieves an average BLEU score of 29.5 with a speed-up of 5.67$\times$, which
is comparable to the autoregressive counterpart and even outperforms the
previous best result of 0.9 BLEU points.

本文介绍了一种基于 CTC 的非自回归语音翻译模型，采用预测感知编码方法和跨层注意力方法解决了翻译任务中的条件独立生成和单调对齐等问题，加速比为 5.67 倍，BLEU 分数为 29.5，在 MuST-C ST 基准测试上优于自回归模型和之前的最佳结果。

基于 CTC 的非自回归语音翻译

CTC-based Non-autoregressive Speech Translation

More and more evidence has shown that strengthening layer interactions can
enhance the representation power of a deep neural network, while self-attention
excels at learning interdependencies by retrieving query-activated information.
Motivated by this, we devise a cross-layer attention mechanism, called
multi-head recurrent layer attention (MRLA), that sends a query representation
of the current layer to all previous layers to retrieve query-related
information from different levels of receptive fields. A light-weighted version
of MRLA is also proposed to reduce the quadratic computation cost. The proposed
layer attention mechanism can enrich the representation power of many
state-of-the-art vision networks, including CNNs and vision transformers. Its
effectiveness has been extensively evaluated in image classification, object
detection and instance segmentation tasks, where improvements can be
consistently observed. For example, our MRLA can improve 1.6% Top-1 accuracy on
ResNet-50, while only introducing 0.16M parameters and 0.07B FLOPs.
Surprisingly, it can boost the performances by a large margin of 3-4% box AP
and mask AP in dense prediction tasks. Our code is available at
this https URL

提出跨层多头循环层关注（MRLA）机制，以检索来自不同感受野级别的查询相关信息来丰富许多视觉网络的表示能力，并在图像分类、目标检测和实例分割等任务中获得了显著的提升。