This paper introduces a novel combination of two tasks, previously treated separately: acoustic-to-articulatory speech inversion (AAI) and phoneme-to-articulatory (PTA) motion estimation. We refer to this joint task as acoustic phoneme-to-articulatory speech inversion (APTAI) and explore two different approaches, both working speaker- and text-independently during inference. We use a multi-task learning setup, with the end-to-end goal of taking raw speech as input and estimating the corresponding articulatory movements, phoneme sequence, and phoneme alignment. While both proposed approaches share these same requirements, they differ in their way of achieving phoneme-related predictions: one is based on frame classification, the other on a two-staged training procedure and forced alignment. We reach competitive performance of 0.73 mean correlation for the AAI task and achieve up to approximately 87% frame overlap compared to a state-of-the-art text-dependent phoneme force aligner.

引入了一种新的方法，将声学到口腔运动的转换和音素到口腔运动估计两项任务结合起来，称之为声学音素到口腔运动的反演。探索了两种不同的方法，在推理过程中都采用与说话人和文本无关的方式。使用多任务学习的模式，以端到端的目标将原始语音作为输入，估计相应的口腔运动、音素序列和音素对齐。两种方法在音素相关预测方面有所不同，一个基于帧分类，另一个采用两阶段训练过程和强制对齐。在声学到口腔运动转换任务中获得了0.73的平均相关性，并与现有的依赖于文本的音素强制对齐器相比，实现了高达87%的帧重叠。

从语音中独立估计发音器官运动和音位对齐