We perform automatic paraphrase detection on subtitle data from the
Opusparcus corpus comprising six European languages: German, English, Finnish,
French, Russian, and Swedish. We train two types of supervised sentence
embedding models: a word-averaging (WA) model and a gated recurrent averaging
network (GRAN) model. We find out that GRAN outperforms WA and is more robust
to noisy training data. Better results are obtained with more and noisier data
than less and cleaner data. Additionally, we experiment on other datasets,
without reaching the same level of performance, because of domain mismatch
between training and test data.

本篇研究提出了采用两种训练模型进行自动同义句检测，发现 GRAN 模型优于 WA 模型，并且对噪声干扰更具鲁棒性，适合于处理更多、更杂的数据，并在其他数据集进行了实验。但是由于域不匹配问题，在测试数据上未能达到相同的性能。

六种语言嘈杂字幕中的释义检测

Paraphrase Detection on Noisy Subtitles in Six Languages

In this paper, a novel approach is proposed to automatically construct
parallel discourse corpus for dialogue machine translation. Firstly, the
parallel subtitle data and its corresponding monolingual movie script data are
crawled and collected from Internet. Then tags such as speaker and discourse
boundary from the script data are projected to its subtitle data via an
information retrieval approach in order to map monolingual discourse to
bilingual texts. We not only evaluate the mapping results, but also integrate
speaker information into the translation. Experiments show our proposed method
can achieve 81.79% and 98.64% accuracy on speaker and dialogue boundary
annotation, and speaker-based language model adaptation can obtain around 0.5
BLEU points improvement in translation qualities. Finally, we publicly release
around 100K parallel discourse data with manual speaker and dialogue boundary
annotation.

本文提出一种新方法，自动构建对话机器翻译的平行话语语料库，经实验表明，采用该方法可以显著提高翻译质量，同时公开了大约 10 万条已经手动标注了发言者和对话边界的平行话语数据。