The promise of interaction between intelligent conversational agents and humans is that models can learn from such feedback in order to improve. Unfortunately, such exchanges in the wild will not always involve human utterances that are benign or of high quality, and will include a mixture of engaged (helpers) and unengaged or even malicious users (trolls). In this work we study how to perform robust learning in such an environment. We introduce a benchmark evaluation, SafetyMix, which can evaluate methods that learn safe vs. toxic language in a variety of adversarial settings to test their robustness. We propose and analyze several mitigating learning algorithms that identify trolls either at the example or at the user level. Our main finding is that user-based methods, that take into account that troll users will exhibit adversarial behavior across multiple examples, work best in a variety of settings on our benchmark. We then test these methods in a further real-life setting of conversations collected during deployment, with similar results.

本文研究如何在人工交互对话中进行鲁棒性强的学习，其中将人工对话分为有害（trolls）和有益（helpers）两类并引入了一种评估方法（SafetyMix）以此来测试学习算法的鲁棒性。研究结果表明在该环境中基于用户的方法比基于样例的方法更为有效。

在混合对抗非对抗的情况下从数据中学习：找到帮手，忽略骗子