The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research, we make the publicly available.

该论文提出了一种新颖的任务，即识别多句式代码混合文本（MCT），制定了一种基于令牌级语言感知的管道，并将现有的度量代码混合程度的方法扩展到多句式框架，并在多语言文章中自动识别MCT，最终构建了一个包含85k个Hinglish MCTs的多句式代码混合Hinglish数据集，名为MUTANT。

MUTANT: 一个多句混合编码的印地英语数据集