State-of-the-art language model fine-tuning techniques, such as Direct
Preference Optimization (DPO), restrict user control by hard-coding predefined
behaviors into the model. To address this, we propose a novel method,
Configurable Safety Tuning (CST), that augments DPO using synthetic preference
data to facilitate flexible safety configuration of LLMs at inference time. CST
overcomes the constraints of vanilla DPO by introducing a system prompt
specifying safety configurations, enabling LLM deployers to disable/enable
safety preferences based on their need, just changing the system prompt. Our
experimental evaluations indicate that CST successfully manages different
safety configurations and retains the original functionality of LLMs, showing
it is a robust method for configurable deployment. Data and models available at
this https URL

提出了一种 Configurable Safety Tuning (CST) 方法，通过使用合成的偏好数据，来增强 Direct Preference Optimization (DPO) 在推理时对语言模型的灵活安全配置，有效地处理了用户控制受限的问题，并通过引入系统提示来实现灵活地启用 / 禁用安全偏好，数据和模型可以在给出的链接中找到。

使用合成偏好数据对语言模型进行可配置的安全调整

Configurable Safety Tuning of Language Models with Synthetic Preference  Data

Practical large language model (LLM) services may involve a long system
prompt, which specifies the instructions, examples, and knowledge documents of
the task and is reused across numerous requests. However, the long system
prompt causes throughput/latency bottlenecks as the cost of generating the next
token grows w.r.t. the sequence length. This paper aims to improve the
efficiency of LLM services that involve long system prompts. Our key
observation is that handling these system prompts requires heavily redundant
memory accesses in existing causal attention computation algorithms.
Specifically, for batched requests, the cached hidden states (i.e., key-value
pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM
multiple times, each corresponding to an individual request. To eliminate such
a redundancy, we propose RelayAttention, an attention algorithm that allows
reading these hidden states from DRAM exactly once for a batch of input tokens.
RelayAttention is a free lunch: it maintains the generation quality while
requiring no model retraining, as it is based on a mathematical reformulation
of causal attention.

通过一种名为 RelayAttention 的算法，该论文提出了一种提高大型语言模型（LLM）服务效率的方法，解决了长系统提示导致的吞吐量 / 延迟瓶颈问题，该算法通过从 DRAM 准确一次性读取输入令牌批次的隐藏状态，从而消除了系统提示的冗余。