Large Language Models (LLMs) have shown exceptional results on current
benchmarks when working individually. The advancement in their capabilities,
along with a reduction in parameter size and inference times, has facilitated
the use of these models as agents, enabling interactions among multiple models
to execute complex tasks. Such collaborations offer several advantages,
including the use of specialized models (e.g. coding), improved confidence
through multiple computations, and enhanced divergent thinking, leading to more
diverse outputs. Thus, the collaborative use of language models is expected to
grow significantly in the coming years. In this work, we evaluate the behavior
of a network of models collaborating through debate under the influence of an
adversary. We introduce pertinent metrics to assess the adversary's
effectiveness, focusing on system accuracy and model agreement. Our findings
highlight the importance of a model's persuasive ability in influencing others.
Additionally, we explore inference-time methods to generate more compelling
arguments and evaluate the potential of prompt-based mitigation as a defensive
strategy.

评估模型网络在对抗影响下通过辩论进行合作时的行为，探索推理时间方法生成更令人信服的论点，并评估基于提示的缓解作为一种防御策略的潜力。

多智能体协作攻击：通过辩论研究大规模语言模型协作中的对抗攻击

MultiAgent Collaboration Attack: Investigating Adversarial Attacks in  Large Language Model Collaborations via Debate

Large language models (LLMs) can generate long-form and coherent text, but
they still frequently hallucinate facts, thus limiting their reliability. To
address this issue, inference-time methods that elicit truthful responses have
been proposed by shifting LLM representations towards learned "truthful
directions". However, applying the truthful directions with the same intensity
fails to generalize across different question contexts. We propose LITO, a
Learnable Intervention method for Truthfulness Optimization that automatically
identifies the optimal intervention intensity tailored to a specific context.
LITO explores a sequence of model generations based on increasing levels of
intervention intensities. It selects the most accurate response or refuses to
answer when the predictions are highly uncertain. Experiments on multiple LLMs
and question-answering datasets demonstrate that LITO improves truthfulness
while preserving task accuracy. The adaptive nature of LITO counters issues
with one-size-fits-all intervention-based solutions, maximizing model
truthfulness by reflecting internal knowledge only when the model is confident.

LITO 提出了一种可学习的干预方法，通过识别特定语境下的最佳干预强度，提高真实性，并在高度不确定时选择最准确的回答或拒绝回答。