Cued Speech (CS) is a multi-modal visual coding system combining lip reading with several hand cues at the phonetic level to make the spoken language visible to the hearing impaired. Previous studies solved asynchronous problems between lip and hand movements by a cuer\footnote{The people who perform Cued Speech are called the cuer.}-dependent piecewise linear model for English and French CS. In this work, we innovatively propose three statistical measure on the lip stream to build an interpretable and generalizable model for predicting hand preceding time (HPT), which achieves cuer-independent by a proper normalization. Particularly, we build the first Mandarin CS corpus comprising annotated videos from five speakers including three normal and two hearing impaired individuals. Consequently, we show that the hand preceding phenomenon exists in Mandarin CS production with significant differences between normal and hearing impaired people. Extensive experiments demonstrate that our model outperforms the baseline and the previous state-of-the-art methods.

本文介绍一种将唇读与手势结合的多模式视觉编码系统——Cued Speech（CS），并在此基础上提出了一种利用统计测量方法的可解释通用模型来预测手先时间（HPT）的方法。此外，对五位说话人的视频进行了注释，并发现手先现象存在于它们的产生中，同时也证明了方法的有效性。

一种基于多Cuer语料库的适用于提示性语言重同步的新型可解释且具有泛化能力的模型