Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation-presents significant challenges that are likely to grow in importance. We investigate this attribution problem using formal language theory, specifically language identification in the limit as introduced by Gold and extended by Angluin. By modeling LLM outputs as formal languages, we analyze whether finite text samples can uniquely pinpoint the originating model. Our results show that due to the non-identifiability of certain language classes, under some mild assumptions about overlapping outputs from fine-tuned models it is theoretically impossible to attribute outputs to specific LLMs with certainty. This holds also when accounting for expressivity limitations of Transformer architectures. Even with direct model access or comprehensive monitoring, significant computational hurdles impede attribution efforts. These findings highlight an urgent need for proactive measures to mitigate risks posed by adversarial LLM use as their influence continues to expand.

本文探讨了在对抗环境下，大语言模型（LLMs）输出归因的问题。通过形式语言理论的视角，研究发现由于某些语言类别的不可识别性，以及微调模型输出的重叠情况，从有限的文本样本中无法确定性地归因于特定的LLM。此发现强调了需要采取积极措施，以减轻对抗性LLM使用带来的风险。

大语言模型的对抗攻击能否被归因？