We examine how large language models (LLMs) generalize from abstract
declarative statements in their training data. As an illustration, consider an
LLM that is prompted to generate weather reports for London in 2050. One
possibility is that the temperatures in the reports match the mean and variance
of reports from 2023 (i.e. matching the statistics of pretraining). Another
possibility is that the reports predict higher temperatures, by incorporating
declarative statements about climate change from scientific papers written in
2023. An example of such a declarative statement is "global temperatures will
increase by $1^{\circ} \mathrm{C}$ by 2050".
To test the influence of abstract declarative statements, we construct tasks
in which LLMs are finetuned on both declarative and procedural information. We
find that declarative statements influence model predictions, even when they
conflict with procedural information. In particular, finetuning on a
declarative statement $S$ increases the model likelihood for logical
consequences of $S$. The effect of declarative statements is consistent across
three domains: aligning an AI assistant, predicting weather, and predicting
demographic features. Through a series of ablations, we show that the effect of
declarative statements cannot be explained by associative learning based on
matching keywords. Nevertheless, the effect of declarative statements on model
likelihoods is small in absolute terms and increases surprisingly little with
model size (i.e. from 330 million to 175 billion parameters). We argue that
these results have implications for AI risk (in relation to the "treacherous
turn") and for fairness.

通过测试影响大型语言模型预测的抽象声明，我们发现即使它们与程序性信息冲突，抽象声明仍然会影响模型的预测结果。这些结果在多个领域中都是一致的，并且与模型规模的增大关系不大。我们认为这些结果对 AI 风险 (与 “叛变点” 相关) 和公平性具有重要的意义。

陈述性事实对 LLMs 推理能力的影响

Tell, don't show: Declarative facts influence how LLMs generalize

Artificial Intelligence (AI) is one of the most transformative technologies
of the 21st century. The extent and scope of future AI capabilities remain a
key uncertainty, with widespread disagreement on timelines and potential
impacts. As nations and technology companies race toward greater complexity and
autonomy in AI systems, there are concerns over the extent of integration and
oversight of opaque AI decision processes. This is especially true in the
subfield of machine learning (ML), where systems learn to optimize objectives
without human assistance. Objectives can be imperfectly specified or executed
in an unexpected or potentially harmful way. This becomes more concerning as
systems increase in power and autonomy, where an abrupt capability jump could
result in unexpected shifts in power dynamics or even catastrophic failures.
This study presents a hierarchical complex systems framework to model AI risk
and provide a template for alternative futures analysis. Survey data were
collected from domain experts in the public and private sectors to classify AI
impact and likelihood. The results show increased uncertainty over the powerful
AI agent scenario, confidence in multiagent environments, and increased concern
over AI alignment failures and influence-seeking behavior.

本文使用分层复杂系统框架对人工智能（AI）风险进行建模，并从公共和私营领域的领域专家收集调查数据以分类 AI 影响和可能性，结果显示强大的 AI 代理情景有更多不确定性，对 AI 对齐失败和影响寻求行为的关注增加以及对多智能体环境的信心增强。