Large language models (LLMs) like ChatGPT have gained increasing prominence in artificial intelligence, making a profound impact on society and various industries like business and science. However, the presence of false information on the internet and in text corpus poses a significant risk to the reliability and safety of LLMs, underscoring the urgent need to understand the mechanisms of how false information impacts and spreads in LLMs. In this paper, we investigate how false information spreads in LLMs and affects related responses by conducting a series of experiments on the effects of source authority, injection paradigm, and information relevance. Specifically, we compare four authority levels of information sources (Twitter, web blogs, news reports, and research papers), two common knowledge injection paradigms (in-context injection and learning-based injection), and three degrees of information relevance (direct, indirect, and peripheral). The experimental results show that (1) False information will spread and contaminate related memories in LLMs via a semantic diffusion process, i.e., false information has global detrimental effects beyond its direct impact. (2) Current LLMs are susceptible to authority bias, i.e., LLMs are more likely to follow false information presented in a trustworthy style like news or research papers, which usually causes deeper and wider pollution of information. (3) Current LLMs are more sensitive to false information through in-context injection than through learning-based injection, which severely challenges the reliability and safety of LLMs even if all training data are trusty and correct. The above findings raise the need for new false information defense algorithms to address the global impact of false information, and new alignment algorithms to unbiasedly lead LLMs to follow internal human values rather than superficial patterns.

本研究探究了虚假信息在大语言模型中的传播机制及其对模型响应的影响，结果表明：虚假信息会通过语义扩散传播并污染相关记忆；大语言模型更容易受到权威偏见的影响；在上下文注入下，大语言模型对虚假信息更敏感。这些结果表明有必要研究新的抵御虚假信息的算法以应对其全局影响，并研究新的对齐算法以使大语言模型遵循内在的人类价值观而非表面模式。

一滴墨汁或可引发百万思考：大型语言模型中虚假信息扩散