The fast advancements in Large Language Models (LLMs) are driving an
increasing number of applications. Together with the growing number of users,
we also see an increasing number of attackers who try to outsmart these
systems. They want the model to reveal confidential information, specific false
information, or offensive behavior. To this end, they manipulate their
instructions for the LLM by inserting separators or rephrasing them
systematically until they reach their goal. Our approach is different. It
inserts words from the model vocabulary. We find these words using an
optimization procedure and embeddings from another LLM (attacker LLM). We prove
our approach by goal hijacking two popular open-source LLMs from the Llama2 and
the Flan-T5 families, respectively. We present two main findings. First, our
approach creates inconspicuous instructions and therefore it is hard to detect.
For many attack cases, we find that even a single word insertion is sufficient.
Second, we demonstrate that we can conduct our attack using a different model
than the target model to conduct our attack with.

使用来自攻击模型的嵌入和优化过程插入模型词汇，我们证明了我们的方法可以成功劫持两个流行的开源大语言模型 Llama2 和 Flan-T5，并显示了我们的方法具有不易被察觉的特点，且仅需插入单个词汇即可进行攻击，我们还证明可以使用不同于目标模型的模型进行攻击。

大规模语言模型应用的词汇攻击

Vocabulary Attack to Hijack Large Language Model Applications

In the cybersecurity setting, defenders are often at the mercy of their
detection technologies and subject to the information and experiences that
individual analysts have. In order to give defenders an advantage, it is
important to understand an attacker's motivation and their likely next best
action. As a first step in modeling this behavior, we introduce a security game
framework that simulates interplay between attackers and defenders in a noisy
environment, focusing on the factors that drive decision making for attackers
and defenders in the variants of the game with full knowledge and
observability, knowledge of the parameters but no observability of the state
(``partial knowledge''), and zero knowledge or observability (``zero
knowledge''). We demonstrate the importance of making the right assumptions
about attackers, given significant differences in outcomes. Furthermore, there
is a measurable trade-off between false-positives and true-positives in terms
of attacker outcomes, suggesting that a more false-positive prone environment
may be acceptable under conditions where true-positives are also higher.

为了提高防御者的战术优势，本文通过引入一种安全游戏框架，模拟了攻击者和防御者在各种不同知晓程度和情境下的决策方式，并探究了在真假报警间取得平衡的方法。

噪声安全游戏攻防交互模拟

Simulation of Attacker Defender Interaction in a Noisy Security Game

We introduce and study a novel majority-based opinion diffusion model.
Consider a graph $G$, which represents a social network. Assume that initially
a subset of nodes, called seed nodes or early adopters, are colored either
black or white, which correspond to positive or negative opinion regarding a
consumer product or a technological innovation. Then, in each round an
uncolored node, which is adjacent to at least one colored node, chooses the
most frequent color among its neighbors.
Consider a marketing campaign which advertises a product of poor quality and
its ultimate goal is that more than half of the population believe in the
quality of the product at the end of the opinion diffusion process. We focus on
three types of attackers which can select the seed nodes in a deterministic or
random fashion and manipulate almost half of them to adopt a positive opinion
toward the product (that is, to choose black color). We say that an attacker
succeeds if a majority of nodes are black at the end of the process. Our main
purpose is to characterize classes of graphs where an attacker cannot succeed.
In particular, we prove that if the maximum degree of the underlying graph is
not too large or if it has strong expansion properties, then it is fairly
resilient to such attacks.
Furthermore, we prove tight bounds on the stabilization time of the process
(that is, the number of rounds it needs to end) in both settings of choosing
the seed nodes deterministically and randomly. We also provide several hardness
results for some optimization problems regarding stabilization time and choice
of seed nodes.

本文探讨一种新型的基于多数派观点扩散模型，研究在社交网络中基于营销活动的产品质量和技术创新做出大众观点的多数派观点的攻击和防御等问题。

社交网络中的多数意见扩散：一种对抗性方法

Majority Opinion Diffusion in Social Networks: An Adversarial Approach

Advances in machine learning have led to broad deployment of systems with
impressive performance on important problems. Nonetheless, these systems can be
induced to make errors on data that are surprisingly similar to examples the
learned system handles correctly. The existence of these errors raises a
variety of questions about out-of-sample generalization and whether bad actors
might use such examples to abuse deployed systems. As a result of these
security concerns, there has been a flurry of recent papers proposing
algorithms to defend against such malicious perturbations of correctly handled
examples. It is unclear how such misclassifications represent a different kind
of security problem than other errors, or even other attacker-produced examples
that have no specific relationship to an uncorrupted input. In this paper, we
argue that adversarial example defense papers have, to date, mostly considered
abstract, toy games that do not relate to any specific security concern.
Furthermore, defense papers have not yet precisely described all the abilities
and limitations of attackers that would be relevant in practical security.
Towards this end, we establish a taxonomy of motivations, constraints, and
abilities for more plausible adversaries. Finally, we provide a series of
recommendations outlining a path forward for future work to more clearly
articulate the threat model and perform more meaningful evaluation.

本文介绍如何通过建立人们更真实可靠的威胁模型，从而更好地保护机器学习在实际应用中的安全性。