Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.

使用Transformer模型，通过引入Temporal-Channel Modeling（TCM）模块来增强multi-head self-attention（MHSA）对于捕捉时域-频域依赖关系的能力，以提升合成语音检测效果。在ASVspoof 2021数据集上进行的实验表明，仅使用0.03M额外参数的TCM模块，在等误拒曲线（EER）指标上超过了当前最先进系统9.25%的性能。进一步的消融研究显示，同时利用时域和频域信息对于检测合成语音效果最好。

多头自注意力中的时间通道建模用于合成语音检测