Human moderation of online conversation is essential to maintaining civility and focus in a dialogue, but is challenging to scale and harmful to moderators. The inclusion of sophisticated natural language generation modules as a force multiplier aid moderators is a tantalizing prospect, but adequate evaluation approaches have so far been elusive. In this paper, we establish a systematic definition of conversational moderation effectiveness through a multidisciplinary lens that incorporates insights from social science. We then propose a comprehensive evaluation framework that uses this definition to asses models' moderation capabilities independently of human intervention. With our framework, we conduct the first known study of conversational dialogue models as moderators, finding that appropriately prompted models can provide specific and fair feedback on toxic behavior but struggle to influence users to increase their levels of respect and cooperation.

本文通过一种多学科的视角，建立了对对话调节有效性的系统定义，并提出了一个综合评估框架，以在无人干预的情况下评估模型的调节能力。通过该框架进行的首个已知的对话模型作为调节员的研究发现，适当引导的模型可以对有害行为提供具体而公正的反馈，但难以影响用户提高他们的尊重和合作水平。

语言模型调节员能提升在线交流的健康性吗？