We present a neural network for rendering binaural speech from given monaural audio, position, and orientation of the source. Most of the previous works have focused on synthesizing binaural speeches by conditioning the positions and orientations in the feature space of convolutional neural networks. These synthesis approaches are powerful in estimating the target binaural speeches even for in-the-wild data but are difficult to generalize for rendering the audio from out-of-distribution domains. To alleviate this, we propose Neural Fourier Shift (NFS), a novel network architecture that enables binaural speech rendering in the Fourier space. Specifically, utilizing a geometric time delay based on the distance between the source and the receiver, NFS is trained to predict the delays and scales of various early reflections. NFS is efficient in both memory and computational cost, is interpretable, and operates independently of the source domain by its design. With up to 25 times lighter memory and 6 times fewer calculations, the experimental results show that NFS outperforms the previous studies on the benchmark dataset.

本文提出了一种基于神经傅里叶移位的新型神经网络结构，名为NFS，该结构能够在傅里叶空间中实现双耳语音合成，其通过预测早期反射的延迟和尺度来实现。该方法在内存和计算成本上都非常有效，且能够独立于源领域进行操作，实验结果表明其在性能和效率上优于以往的类似研究。

神经傅里叶平移在双耳朵渲染中的应用