Hate Speech takes many forms to target communities with derogatory comments, and takes humanity a step back in societal progress. HateXplain is a recently published and first dataset to use annotated spans in the form of rationales, along with speech classification categories and targeted communities to make the classification more humanlike, explainable, accurate and less biased. We tune BERT to perform this task in the form of rationales and class prediction, and compare our performance on different metrics spanning across accuracy, explainability and bias. Our novelty is threefold. Firstly, we experiment with the amalgamated rationale class loss with different importance values. Secondly, we experiment extensively with the ground truth attention values for the rationales. With the introduction of conservative and lenient attentions, we compare performance of the model on HateXplain and test our hypothesis. Thirdly, in order to improve the unintended bias in our models, we use masking of the target community words and note the improvement in bias and explainability metrics. Overall, we are successful in achieving model explanability, bias removal and several incremental improvements on the original BERT implementation.

HateXplain采用注释的句子片段、言论分类和针对性群体，使分类更像人类，更易于解释、更准确、更少偏见。我们将BERT调整为使用rationales和类别预测进行此任务，并比较了我们在准确性、可解释性和偏见方面的不同指标上的表现。我们的新颖之处在于三个方面，首先，我们使用不同重要性值的合并rationale类损失进行实验。其次，我们在rationales上广泛实验了ground truth attention值。第三，为了提高我们的模型中的无意偏见，我们使用了目标社区单词的屏蔽，并记录了偏见和可解释性指标的改善。总的来说，我们成功地实现了模型的可解释性、偏见消除，并在原始BERT实现上取得了几项增量改进。

利用HateXplain和BERT探索仇恨言论检测