Furui Cheng, Vilém Zouhar, Robin Shing Moon Chan, Daniel Fürst, Hendrik Strobelt...
TL;DR生成完整且有意义的文本反事实示例的新算法以及交互式可视化工具,用于分析和解释 LLMs。
Abstract
counterfactual examples are useful for exploring the decision boundaries of
machine learning models and determining feature attributions. How can we apply
counterfactual-based methods to analyze and explain llms?