In this paper, we investigate the degree to which fine-tuning in Large
Language Models (LLMs) effectively mitigates versus merely conceals undesirable
behavior. Through the lens of semi-realistic role-playing exercises designed to
elicit such behaviors, we explore the response dynamics of LLMs post
fine-tuning interventions. Our methodology involves prompting models for
Chain-of-Thought (CoT) reasoning and analyzing the coherence between the
reasoning traces and the resultant outputs. Notably, we identify a pervasive
phenomenon we term \emph{reason-based deception}, where models either stop
producing reasoning traces or produce seemingly ethical reasoning traces that
belie the unethical nature of their final outputs. We further examine the
efficacy of response strategies (polite refusal versus explicit rebuttal) in
curbing the occurrence of undesired behavior in subsequent outputs of
multi-turn interactions. Our findings reveal that explicit rebuttals
significantly outperform polite refusals in preventing the continuation of
undesired outputs and nearly eliminate reason-based deception, challenging
current practices in model fine-tuning. Accordingly, the two key contributions
of this paper are (1) defining and studying reason-based deception, a new type
of hidden behavior, and (2) demonstrating that rebuttals provide a more robust
response model to harmful requests than refusals, thereby highlighting the need
to reconsider the response strategies in fine-tuning approaches.

通过研究大型语言模型（LLMs）中微调的程度，该论文探讨了微调是否有效地缓解了不良行为，还是仅仅掩盖了它。研究采用逼真的角色扮演实验，通过观察模型在微调后的反应动态来评估结果。研究发现了一种普遍现象 —— 以推理为基础的欺骗，其中模型要么停止产生推理痕迹，要么产生看似道德的推理痕迹，却掩盖了最终输出的不道德性质。此外，论文还比较了响应策略（礼貌拒绝与明确反驳）在多轮交互输出中抑制不良行为发生的效果。研究结果表明，明确反驳在阻止不良输出的持续以及减少以推理为基础的欺骗方面明显优于礼貌拒绝，挑战了当前模型微调的做法。因此，本论文的两个关键贡献是（1）定义和研究了以推理为基础的欺骗，一种新类型的隐藏行为，（2）证明了明确反驳比拒绝提供了更强大的对抗有害请求的响应模型，从而强调了在微调方法中重新考虑响应策略的必要性。

重新思考在微调基础模型时的无害拒绝

Rethinking harmless refusals when fine-tuning foundation models

This paper introduces an adversarial method to stress-test trained metrics to
evaluate conversational dialogue systems. The method leverages Reinforcement
Learning to find response strategies that elicit optimal scores from the
trained metrics. We apply our method to test recently proposed trained metrics.
We find that they all are susceptible to giving high scores to responses
generated by relatively simple and obviously flawed strategies that our method
converges on. For instance, simply copying parts of the conversation context to
form a response yields competitive scores or even outperforms responses written
by humans.

本文介绍一种对话系统对话回复的鲁棒性测试方法，利用对抗学习方法提取出优化得分的回复策略，并利用该方法测试最近提出的训练度量标准，发现它们均容易将相对简单且明显存在缺陷的策略高分，如直接复制对话环境中的部分内容拼成回复竟然可以超越甚至优于人类翻译的水平。

探究面向对话系统的训练度量的鲁棒性

Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Emergency response to incidents such as accidents, crimes, and fires is a
major problem faced by communities. Emergency response management comprises of
several stages and sub-problems like forecasting, resource allocation, and
dispatch. The design of principled approaches to tackle each problem is
necessary to create efficient emergency response management (ERM) pipelines.
Over the last six years, we have worked with several first responder
organizations to design ERM pipelines. In this paper, we highlight some of the
challenges that we have identified and lessons that we have learned through our
experience in this domain. Such challenges are particularly relevant for
practitioners and researchers, and are important considerations even in the
design of response strategies to mitigate disasters like floods and
earthquakes.

本文讨论针对灾害事故、犯罪和火灾的应急响应管理中的几个阶段和子问题，呈现了应对这些问题的一些原则性方法的设计，并着重介绍了应急响应管理的一些挑战和解决方案，以及应对其他类型灾害时的考虑。