Recent advancements in open-domain dialogue systems have been propelled by
the emergence of high-quality large language models (LLMs) and various
effective training methodologies. Nevertheless, the presence of toxicity within
these models presents a significant challenge that can potentially diminish the
user experience. In this study, we introduce an innovative training algorithm,
an improvement upon direct preference optimization (DPO), called adversarial
DPO (ADPO). The ADPO algorithm is designed to train models to assign higher
probability distributions to preferred responses and lower distributions to
unsafe responses, which are self-generated using the toxic control token. We
demonstrate that ADPO enhances the model's resilience against harmful
conversations while minimizing performance degradation. Furthermore, we
illustrate that ADPO offers a more stable training procedure compared to the
traditional DPO. To the best of our knowledge, this is the first adaptation of
the DPO algorithm that directly incorporates harmful data into the generative
model, thereby reducing the need to artificially create safe dialogue data.

创新的训练算法 ADPO 提高了模型对有害对话的鲁棒性，同时最大限度地减少性能下降，并首次将有害数据直接纳入生成模型中，减少了人工创建安全对话数据的需求。