TL;DR本文提出一种基于跨语言句嵌入的无监督打分函数,用于计算不同语言中文档之间的语义距离,从而指导文档对齐算法以适当地匹配跨语言 Web 文档,并在不同语言对中显著提高对齐效果。
Abstract
document alignment aims to identify pairs of documents in two distinct
languages that are of comparable content or translations of each other. Such
aligned data can be used for a variety of NLP tasks from training cross-lingual
representations to mining parallel data for machine transl