Using a vision-inspired keyword spotting framework, we propose an architecture with input-dependent dynamic depth capable of processing streaming audio. Specifically, we extend a conformer encoder with trainable binary gates that allow us to dynamically skip network modules according to the input audio. Our approach improves detection and localization accuracy on continuous speech using Librispeech top-1000 most frequent words while maintaining a small memory footprint. The inclusion of gates also reduces the average amount of processing without affecting the overall performance. These benefits are shown to be even more pronounced using the Google speech commands dataset placed over background noise where up to 97% of the processing is skipped on non-speech inputs, therefore making our method particularly interesting for an always-on keyword spotter.

利用一种以视觉为灵感的关键词检测框架，我们提出了一种具有输入相关动态深度的架构，能够处理流媒体音频。我们通过在可训练的二进制门中扩展一个conformer编码器来动态跳过网络模块。我们的方法在使用Librispeech前1000个最常见单词进行持续语音上的检测和定位准确性方面有所提高，同时还保持了较小的内存占用。引入门也减少了处理的平均量，而不影响整体性能。在背景噪声下使用谷歌语音命令数据集时，这些好处尤为明显，非语音输入中可省略多达97%的处理，因此使我们的方法特别适用于始终开启的关键词检测器。

基于动态模块跳过的流式conformer编码器提升视觉驱动的关键词识别