We evaluate whether LLMs learn to make human-like preference judgements in
strategic scenarios as compared with known empirical results. We show that
Solar and Mistral exhibit stable value-based preference consistent with human
in the prisoner's dilemma, including stake-size effect, and traveler's dilemma,
including penalty-size effect. We establish a relationship between model size,
value based preference, and superficiality. Finally, we find that models that
tend to be less brittle were trained with sliding window attention.
Additionally, we contribute a novel method for constructing preference
relations from arbitrary LLMs and support for a hypothesis regarding human
behavior in the traveler's dilemma.

我们评估 LLMs 在战略场景中是否学会进行类似人类的偏好判断，结果显示 Solar 和 Mistral 表现出稳定的基于价值的偏好，包括与人类一致的囚徒困境和旅行者困境中的利益大小效应和罚款大小效应，我们发现模型的大小、基于价值的偏好和表面性之间存在关系，最后我们发现使用滑动窗口注意力训练的模型更加稳健，此外，我们提出了一种从任意 LLMs 构造偏好关系的新方法，并支持一个关于旅行者困境中人类行为的假设。

大型语言模型是否学习类似人类的战略偏好？

Do Large Language Models Learn Human-Like Strategic Preferences?

Neighborhood attention reduces the cost of self attention by restricting each
token's attention span to its nearest neighbors. This restriction,
parameterized by a window size and dilation factor, draws a spectrum of
possible attention patterns between linear projection and self attention.
Neighborhood attention, and more generally sliding window attention patterns,
have long been bounded by infrastructure, particularly in higher-rank spaces
(2-D and 3-D), calling for the development of custom kernels, which have been
limited in either functionality, or performance, if not both. In this work, we
first show that neighborhood attention can be represented as a batched GEMM
problem, similar to standard attention, and implement it for 1-D and 2-D
neighborhood attention. These kernels on average provide 895% and 272%
improvement in full precision latency compared to existing naive kernels for
1-D and 2-D neighborhood attention respectively. We find certain inherent
inefficiencies in all unfused neighborhood attention kernels that bound their
performance and lower-precision scalability. We also developed fused
neighborhood attention; an adaptation of fused dot-product attention kernels
that allow fine-grained control over attention across different spatial axes.
Known for reducing the quadratic time complexity of self attention to a linear
complexity, neighborhood attention can now enjoy a reduced and constant memory
footprint, and record-breaking half precision latency. We observe that our
fused kernels successfully circumvent some of the unavoidable inefficiencies in
unfused implementations. While our unfused GEMM-based kernels only improve half
precision performance compared to naive kernels by an average of 496% and 113%
in 1-D and 2-D problems respectively, our fused kernels improve naive kernels
by an average of 1607% and 581% in 1-D and 2-D problems respectively.

邻域自注意力通过限制每个标记的注意力范围为其最近的邻居来降低自注意力的成本。这种限制通过窗口大小和膨胀因子参数化，绘制了在线性投影和自注意力之间的一系列可能的注意力模式。我们将邻域注意力表示为分批 GEMM 问题，实现了 1-D 和 2-D 邻域注意力，并且与现有的朴素内核相比，平均提供了 895% 和 272％的全精度延迟改进，我们观察到我们的融合内核成功地规避了未融合实现中不可避免的低效率。

更快的邻域注意力机制：在线程块级别降低自注意力的 O (n^2) 复杂度

Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self  Attention at the Threadblock Level

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered
for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B
across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and
code generation. Our model leverages grouped-query attention (GQA) for faster
inference, coupled with sliding window attention (SWA) to effectively handle
sequences of arbitrary length with a reduced inference cost. We also provide a
model fine-tuned to follow instructions, Mistral 7B -- Instruct, that surpasses
the Llama 2 13B -- Chat model both on human and automated benchmarks. Our
models are released under the Apache 2.0 license.

Mistral 7B v0.1 是一个拥有 70 亿参数的语言模型，通过使用分组查询注意力（GQA）和滑动窗口注意力（SWA）提高了推理效率，并且还提供了一个经过调整的模型 Mistral 7B -- Instruct，在人类和自动化评测中都超过了 Llama 2 13B -- Chat 模型。