BriefGPT.xyz
Feb, 2023
分析和编辑植入后门的语言模型内部机制
Analyzing And Editing Inner Mechanisms Of Backdoored Language Models
HTML
PDF
Max Lamparth, Anka Reuel
TL;DR
本文介绍了一种新的可解释工具PCP ablation,通过替换MLP和attention层的模块,减少模型参数和行为,剖析transformer语言模型的内部情感变化处理机制,为后门机制的删除、插入和修改工程化替代提供了重要的指导。
Abstract
Recent advancements in
interpretability research
made
transformer language models
more transparent. This progress led to a better understanding of their inner workings for toy and naturally occurring models. Howe
→