使用Seq2Seq模型和Levenshtein距离对混合数据中的音译词进行规范化

May, 2018

使用Seq2Seq模型和Levenshtein距离对混合数据中的音译词进行规范化

Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Soumil Mandal, Karthick Nanmaran

TL;DR本文介绍了一种新型的体系结构，专注于标准化语音打字变体，在某些情况下还可用于反向音译和单词识别，在测试数据上达到了90.27%的准确率，解决了与社交媒体情境相关的困难同时应对语法不一致和拼写变化的挑战。

Abstract

Building tools for code-mixed data is rapidly gaining popularity in the nlp research community as such data is exponentially rising on social media. Working with →