Automated offensive language detection is essential in combating the spread of hate speech, particularly in social media. This paper describes our work on Offensive Language Identification in low resource Indic language Marathi. The problem is formulated as a text classification task to identify a tweet as offensive or non-offensive. We evaluate different mono-lingual and multi-lingual BERT models on this classification task, focusing on BERT models pre-trained with social media datasets. We compare the performance of MuRIL, MahaTweetBERT, MahaTweetBERT-Hateful, and MahaBERT on the HASOC 2022 test set. We also explore external data augmentation from other existing Marathi hate speech corpus HASOC 2021 and L3Cube-MahaHate. The MahaTweetBERT, a BERT model, pre-trained on Marathi tweets when fine-tuned on the combined dataset (HASOC 2021 + HASOC 2022 + MahaHate), outperforms all models with an F1 score of 98.43 on the HASOC 2022 test set. With this, we also provide a new state-of-the-art result on HASOC 2022 / MOLD v2 test set.

本文介绍了我们在印度低资源口语马拉地语中的Offensive Language Identification的工作, 讨论了使用BERT模型进行文本分类任务以识别推文是否冒犯，比较了不同BERT模型在HASOC 2022测试集上的表现，包括从其他现有Marathi仇恨言论语料库HASOC 2021和L3Cube-MahaHate进行的扩充等，并且当将MahaTweetBERT模型在结合数据集（HASOC 2021 + HASOC 2022 + MahaHate）上进行微调时，其在HASOC 2022测试集上取得了98.43的F1得分，这也是HASOC 2022 / MOLD v2测试集的新最优表现。

基于Twitter BERT的Marathi语攻击性语言检测方法