Empathy plays a pivotal role in fostering prosocial behavior, often triggered by the sharing of personal experiences through narratives. However, modeling empathy using NLP approaches remains challenging due to its deep interconnection with human interaction dynamics. Previous approaches, which involve fine-tuning language models (LMs) on human-annotated empathic datasets, have had limited success. In our pursuit of improving empathy understanding in LMs, we propose several strategies, including contrastive learning with masked LMs and supervised fine-tuning with Large Language Models (LLMs). While these methods show improvements over previous methods, the overall results remain unsatisfactory. To better understand this trend, we performed an analysis which reveals a low agreement among annotators. This lack of consensus hinders training and highlights the subjective nature of the task. We also explore the cultural impact on annotations. To study this, we meticulously collected story pairs in Urdu language and find that subjectivity in interpreting empathy among annotators appears to be independent of cultural background. The insights from our systematic exploration of LMs' understanding of empathy suggest that there is considerable room for exploration in both task formulation and modeling.

通过人类交互动态的深度相互关联，人类关怀在促进亲社会行为方面起着关键作用，然而，利用自然语言处理方法对关怀进行建模仍然具有挑战性。从我们对提高语言模型中理解关怀的追求出发，我们提出了几种策略，包括在掩码语言模型中进行对比学习以及使用大型语言模型进行监督微调。尽管这些方法相对于以往的方法展示出了改进，但整体结果仍然不令人满意。为了更好地理解这一趋势，我们进行了分析，发现标注者之间存在低一致性。这种缺乏共识阻碍了训练，并强调了任务的主观性。我们还探讨了注释对文化的影响。为了研究这一点，我们精心收集了乌尔都语的故事对，并发现在解释标注者之间的关怀时主观性似乎与文化背景无关。我们对语言模型对关怀理解的系统探索所获得的见解表明，在任务制定和建模方面还有相当大的探索空间。

机器能与人类共鸣吗？评估语言模型的情感和共情理解能力