Recent advances in deep learning have facilitated the design of speaker verification systems that directly input raw waveforms. For example, RawNet extracts speaker embeddings from raw waveforms, which simplifies the process pipeline and demonstrates competitive performance. In this study, we improve RawNet by rescaling feature maps using various methods. The proposed mechanism utilizes a filter-wise rescale map that adopts a sigmoid non-linear function. It refers to a vector with dimensionality equal to the number of filters in a given feature map. Using a filter-wise rescale map, we propose to rescale the feature map multiplicatively, additively, or both. In addition, we investigate replacing the first convolution layer with the sinc-convolution layer of SincNet. Experiments performed on the VoxCeleb1 evaluation dataset demonstrate that the proposed methods are effective, and the best performing system reduces the equal error rate by half compared to the original RawNet. Expanded evaluation results obtained using the VoxCeleb1-E and VoxCeleb-H protocols marginally outperform existing state-of-the-art systems.

本研究提出了使用各种方法来缩放特征图的机制，包括使用sigmoid非线性函数采用缩放向量来乘法和加法缩放特征图，以及使用SincNet的sinc-convolution层替换第一卷积层，实验结果表明该方法有效，最佳表现的系统较原始RawNet减少一半的等误差率，并在VoxCeleb1-E和VoxCeleb-H协议下实现了优于现有最先进系统的扩展评估结果。

使用特征图缩放的改进型RawNet实现基于原始波形的语音识别中的文本无关说话人验证