BriefGPT.xyz
May, 2024
对抗性数据预处理:减少对话代理中的毒性同时对连贯性和伪装性的影响最小化
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents
HTML
PDF
San Kim, Gary Geunbae Lee
TL;DR
创新的训练算法ADPO提高了模型对有害对话的鲁棒性,同时最大限度地减少性能下降,并首次将有害数据直接纳入生成模型中,减少了人工创建安全对话数据的需求。
Abstract
Recent advancements in
open-domain dialogue systems
have been propelled by the emergence of high-quality
large language models
(LLMs) and various effective training methodologies. Nevertheless, the presence of
→