A natural way to design a negotiation dialogue system is via self-play RL: train an agent that learns to maximize its performance by interacting with a simulated user that has been designed to imitate human-human dialogue data. Although this procedure has been adopted in prior work, we find that it results in a fundamentally flawed system that fails to learn the value of compromise in a negotiation, which can often lead to no agreements (i.e., the partner walking away without a deal), ultimately hurting the model's overall performance. We investigate this observation in the context of the DealOrNoDeal task, a multi-issue negotiation over books, hats, and balls. Grounded in negotiation theory from Economics, we modify the training procedure in two novel ways to design agents with diverse personalities and analyze their performance with human partners. We find that although both techniques show promise, a selfish agent, which maximizes its own performance while also avoiding walkaways, performs superior to other variants by implicitly learning to generate value for both itself and the negotiation partner. We discuss the implications of our findings for what it means to be a successful negotiation dialogue system and how these systems should be designed in the future.

自我博弈强化学习是设计谈判对话系统的一种自然方法：通过训练一个与模拟用户互动的代理来最大化其性能，该模拟用户能够模仿人际对话数据。然而，在先前的工作中发现，这种方法导致系统存在根本缺陷，无法学习妥协的价值，经常导致达不成协议（即对方没有交易），最终损害了模型的整体性能。在以书籍、帽子和球为对象的多问题谈判任务中，基于经济学的谈判理论，我们通过两种新颖的方式修改训练程序，设计具有不同个性的代理并分析其与人类合作伙伴的表现。研究发现，虽然两种技术都有潜力，但一种自私的代理（在最大化自身性能的同时避免退出）通过隐性学习为自己和谈判伙伴产生价值而表现优于其他变体。我们讨论了这些发现对于成功谈判对话系统的意义以及如何设计这些系统的未来影响。

自私而明智：探讨人机互动中代理人个性的影响