Given the extensive research and real-world applications of automatic speech recognition (ASR), ensuring the robustness of ASR models against minor input perturbations becomes a crucial consideration for maintaining their effectiveness in real-time scenarios. Previous explorations into ASR model robustness have predominantly revolved around evaluating accuracy on white-box settings with full access to ASR models. Nevertheless, full ASR model details are often not available in real-world applications. Therefore, evaluating the robustness of black-box ASR models is essential for a comprehensive understanding of ASR model resilience. In this regard, we thoroughly study the vulnerability of practical black-box attacks in cutting-edge ASR models and propose to employ two advanced time-domain-based transferable attacks alongside our differentiable feature extractor. We also propose a speech-aware gradient optimization approach (SAGO) for ASR, which forces mistranscription with minimal impact on human imperceptibility through voice activity detection rule and a speech-aware gradient-oriented optimizer. Our comprehensive experimental results reveal performance enhancements compared to baseline approaches across five models on two databases.

本研究解决了在现实应用中，自动语音识别（ASR）模型对输入扰动的鲁棒性不足的问题。我们提出了一种创新的方法，通过时间域的可转移攻击和语音感知梯度优化（SAGO），有效地增强了黑箱 ASR 模型的抗攻击能力。实验结果表明，在两个数据库的五个模型上，我们的方法显著优于基线方法。

可转移的对抗攻击针对自动语音识别