One precondition of effective oral communication is that words should be pronounced clearly, especially for non-native speakers. Word stress is the key to clear and correct English, and misplacement of syllable stress may lead to misunderstandings. Thus, knowing the stress level is important for English speakers and learners. This paper presents a self-attention model to identify the stress level for each syllable of spoken English. Various prosodic and categorical features, including the pitch level, intensity, duration and type of the syllable and its nuclei (the vowel of the syllable), are explored. These features are input to the self-attention model, and syllable-level stresses are predicted. The simplest model yields an accuracy of over 88% and 93% on different datasets, while more advanced models provide higher accuracy. Our study suggests that the self-attention model can be promising in stress-level detection. These models could be applied to various scenarios, such as online meetings and English learning.

本文介绍了一种自注意力模型，用于识别英语口语每个音节的重音级别，通过探索音高、强度、持续时间、音节类型和核心（音节的元音）等韵律和范畴特征，将这些特征输入到自注意力模型中，预测音节级别的重音。该研究表明自注意力模型在重音级别检测中具有良好的前景，可以应用于在线会议和英语学习等各种场景。

使用自注意力模型检测音节级发音重音