Humans rarely learn one fact in isolation. Instead, learning a new fact
induces knowledge of other facts about the world. For example, in learning a
korat is a type of cat, you also infer it is a mammal and has claws, ensuring
your model of the world is consistent. Knowledge editing aims to inject new
facts into language models to improve their factuality, but current benchmarks
fail to evaluate consistency, which is critical to ensure efficient, accurate,
and generalizable edits. We manually create TAXI, a new benchmark dataset
specifically created to evaluate consistency. TAXI contains 11,120
multiple-choice queries for 976 edits spanning 41 categories (e.g., Dogs), 164
subjects (e.g., Labrador), and 183 properties (e.g., is a mammal). We then use
TAXI to evaluate popular editors' consistency, measuring how often editing a
subject's category appropriately edits its properties. We find that 1) the
editors achieve marginal, yet non-random consistency, 2) their consistency far
underperforms human baselines, and 3) consistency is more achievable when
editing atypical subjects. Our code and data are available at
this https URL

人工编辑语言模型的知识注入对语义的一致性要求较高，现有的基准数据集无法充分评估一致性，本文创建了 TAXI 基准数据集，并使用它评估了流行编辑器的一致性表现，发现编辑器的一致性明显低于人类基准，且在编辑非典型主题时更易实现一致性。