BriefGPT.xyz
Nov, 2024
大语言模型的对抗攻击能否被归因?
Can adversarial attacks by large language models be attributed?
HTML
PDF
Manuel Cebrian, Jan Arne Telle
TL;DR
本文探讨了在对抗环境下,大语言模型(LLMs)输出归因的问题。通过形式语言理论的视角,研究发现由于某些语言类别的不可识别性,以及微调模型输出的重叠情况,从有限的文本样本中无法确定性地归因于特定的LLM。此发现强调了需要采取积极措施,以减轻对抗性LLM使用带来的风险。
Abstract
Attributing outputs from
Large Language Models
(LLMs) in adversarial settings-such as cyberattacks and disinformation-presents significant challenges that are likely to grow in importance. We investigate this
Attributio
→