TL;DR本篇研究旨在建立自然语言处理模型,通过针对印度语的公共数据集进行微调以及训练,使得机器的提取问答任务的表现比已有模型更为优秀。基于 RoBERTa 模型构建的两种模型表现最好,证实了对于特定语言任务而言,训练数据的特异性对模型的表现影响更大。
Abstract
indic languages like Hindi and Tamil are underrepresented in the natural language processing (NLP) field compared to languages like English. Due to this underrepresentation, performance on NLP tasks (such as sear