Recognizing a speaker's level of commitment to a belief is a difficult task;
humans do not only interpret the meaning of the words in context, but also
understand cues from intonation and other aspects of the audio signal. Many
papers and corpora in the NLP community have approached the belief prediction
task using text-only approaches. We are the first to frame and present results
on the multimodal belief prediction task. We use the CB-Prosody corpus (CBP),
containing aligned text and audio with speaker belief annotations. We first
report baselines and significant features using acoustic-prosodic features and
traditional machine learning methods. We then present text and audio baselines
for the CBP corpus fine-tuning on BERT and Whisper respectively. Finally, we
present our multimodal architecture which fine-tunes on BERT and Whisper and
uses multiple fusion methods, improving on both modalities alone.

识别说话者对信仰的承诺水平是一项困难的任务；我们是第一个推断多模态信仰预测任务并展示结果的研究，使用了包含对齐文本和音频的说话者信仰标注的 CB-Prosody 语料库（CBP）；我们报告了使用声学 - 韵律特征和传统机器学习方法的基线和显著特征；我们还提出了基于 BERT 和 Whisper 的 CBP 语料库微调的文本和音频基线；最后，我们提出了使用 BERT 和 Whisper 的多模态架构，并使用多种融合方法，改善了单独的模态。

多模态信念预测

Multimodal Belief Prediction

In conversational speech, the acoustic signal provides cues that help
listeners disambiguate difficult parses. For automatically parsing spoken
utterances, we introduce a model that integrates transcribed text and
acoustic-prosodic features using a convolutional neural network over energy and
pitch trajectories coupled with an attention-based recurrent neural network
that accepts text and prosodic features. We find that different types of
acoustic-prosodic features are individually helpful, and together give
statistically significant improvements in parse and disfluency detection F1
scores over a strong text-only baseline. For this study with known sentence
boundaries, error analyses show that the main benefit of acoustic-prosodic
features is in sentences with disfluencies, attachment decisions are most
improved, and transcription errors obscure gains from prosody.

本文中，我们提出了一个模型，该模型使用卷积神经网络对能量和音高轨迹进行耦合，并使用基于注意力机制的循环神经网络，接受文本和韵律特征，并结合转录文本和声学 - 韵律特征，以自动解析口语话语，并发现不同类型的声学 - 韵律特征都有助于解析，对比一个强文本基线，该模型取得了显著的改进。错误分析表明，声学 - 韵律特征的主要优点在于有误流畅度的句子，附加决策得到最大的改进，文本转录错误掩盖了音韵的改进。