This paper targets the perceptual task of separating the different
interacting voices, i.e., monophonic melodic streams, in a polyphonic musical
piece. We target symbolic music, where notes are explicitly encoded, and model
this task as a Multi-Trajectory Tracking (MTT) problem from discrete
observations, i.e., notes in a pitch-time space. Our approach builds a graph
from a musical piece, by creating one node for every note, and separates the
melodic trajectories by predicting a link between two notes if they are
consecutive in the same voice/stream. This kind of local, greedy prediction is
made possible by node embeddings created by a heterogeneous graph neural
network that can capture inter- and intra-trajectory information. Furthermore,
we propose a new regularization loss that encourages the output to respect the
MTT premise of at most one incoming and one outgoing link for every node,
favouring monophonic (voice) trajectories; this loss function might also be
useful in other general MTT scenarios. Our approach does not use
domain-specific heuristics, is scalable to longer sequences and a higher number
of voices, and can handle complex cases such as voice inversions and overlaps.
We reach new state-of-the-art results for the voice separation task in
classical music of different styles.

这篇论文采用图神经网络，将声音分离问题建模为多轨迹跟踪问题，利用离散观察中的多音符信息实现对音乐的分解处理，使用新的正则化损失函数得到了最新的分离结果。

将音乐声音分离视为关联预测问题：将一个音乐感知任务建模为多轨迹跟踪问题

Musical Voice Separation as Link Prediction: Modeling a Musical  Perception Task as a Multi-Trajectory Tracking Problem

This paper presents an audio-visual approach for voice separation which
produces state-of-the-art results at a low latency in two scenarios: speech and
singing voice. The model is based on a two-stage network. Motion cues are
obtained with a lightweight graph convolutional network that processes face
landmarks. Then, both audio and motion features are fed to an audio-visual
transformer which produces a fairly good estimation of the isolated target
source. In a second stage, the predominant voice is enhanced with an audio-only
network. We present different ablation studies and comparison to
state-of-the-art methods. Finally, we explore the transferability of models
trained for speech separation in the task of singing voice separation. The
demos, code, and weights are available in this https URL

本文提出了一种音频 - 视觉声音分离方案，在两种不同场景（语音和唱歌）中实现了低时延的最新成果。该模型基于两级网络，采用轻量级图卷积网络从面部标记中提取运动线索，然后将视觉和音频特征输入到音频 - 视觉转换器中，为目标源的隔离估计提供相当不错的结果。在第二阶段，利用音频网络增强了主要的声音。我们进行了不同的消融研究和与最先进的方法比较。最后，我们探讨了在唱声分离任务中训练语音分离模型的可转移性。

VoViT: 基于图论的低延迟音视频语音分离 Transformer

VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

When video is shot in noisy environment, the voice of a speaker seen in the
video can be enhanced using the visible mouth movements, reducing background
noise. While most existing methods use audio-only inputs, improved performance
is obtained with our visual speech enhancement, based on an audio-visual neural
network. We include in the training data videos to which we added the voice of
the target speaker as background noise. Since the audio input is not sufficient
to separate the voice of a speaker from his own voice, the trained model better
exploits the visual input and generalizes well to different noise types. The
proposed model outperforms prior audio visual methods on two public lipreading
datasets. It is also the first to be demonstrated on a dataset not designed for
lipreading, such as the weekly addresses of Barack Obama.

本研究使用基于视听神经网络的视觉语音增强方法，在包含有目标演讲者语音的视频背景噪音情况下，通过口型运动提高演讲者语音的清晰度，在嘈杂的环境中实现了语音增强和噪音降低效果，相较于先前的视听方法在两个公共的口形读音数据集上表现更好，同时也是第一个在面向非口形读音的数据集（如巴拉克・奥巴马每周的演讲）上进行的示例研究。