The rapid advancement of large language models (LLMs) presents both
opportunities and challenges, particularly concerning unintentional generation
of harmful and toxic responses. While the traditional alignment methods strive
to steer LLMs towards desired performance and shield them from malicious
content, this study proposes a novel alignment strategy rooted in mistake
analysis by exposing LLMs to flawed outputs purposefully and then conducting a
thorough assessment to fully comprehend internal reasons via natural language
analysis. Thus, toxic responses can be transformed into instruction tuning
corpus for model alignment, and LLMs can not only be deterred from generating
flawed responses but also trained to self-criticize, leveraging its innate
ability to discriminate toxic content. Experimental results demonstrate that
the proposed method outperforms conventional alignment techniques for safety
instruction following, while maintaining superior efficiency.

通过暴露大型语言模型存在的缺陷输出并进行彻底评估，该研究提出了一种根据错误分析的新型对齐策略，以完全理解其内部原因，并将有害回应转化为模型对齐的指令调整语料库，从而不仅使 LLMs 不再产生有缺陷的回应，还可训练其自我批评，并利用其判别有毒内容的内在能力，实验结果表明，该方法在安全指令跟踪方面优于传统对齐技术，同时保持卓越的效率。