Advancements in large language models (LLMs) have demonstrated remarkable
capabilities across a diverse range of applications. These models excel in
generating text completions that are contextually coherent and cover an
extensive array of subjects. However, the vast datasets required for their
training make aligning response styles during the pretraining and instruction
tuning phases challenging. Consequently, an additional alignment phase is
typically employed, wherein the model is further trained with human preference
data to better align its outputs with human expectations. While this process
doesn't introduce new capabilities per se, it does accentuate generation styles
innate to the model. This paper explores the utilization of counterfactual
prompting within the framework of Direct Preference Optimization (DPO) to align
the model's style without relying on human intervention. We demonstrate that
this method effectively instils desirable behaviour, mitigates undesirable
ones, and encourages the model to disregard inappropriate instructions. Our
findings suggest that counterfactual prompting with DPO presents a low-resource
way to fine-tune LLMs to meet the demands for responsible and ethically aligned
AI systems.

探究利用反事实提示以及直接偏好优化框架来对齐模型风格的方法，该方法有效地注入了良好的行为并减轻了不理想的情况，鼓励模型忽略不合适的指令，从而以低成本的方式使大型语言模型满足对负责任和道德对齐的人工智能系统的需求。

使用反事实数据处理器调整大型语言模型

Aligning Large Language Models with Counterfactual DPO

This paper studies the problem of training a two-layer ReLU network for
binary classification using gradient flow with small initialization. We
consider a training dataset with well-separated input vectors: Any pair of
input data with the same label are positively correlated, and any pair with
different labels are negatively correlated. Our analysis shows that, during the
early phase of training, neurons in the first layer try to align with either
the positive data or the negative data, depending on its corresponding weight
on the second layer. A careful analysis of the neurons' directional dynamics
allows us to provide an $\mathcal{O}(\frac{\log n}{\sqrt{\mu}})$ upper bound on
the time it takes for all neurons to achieve good alignment with the input
data, where $n$ is the number of data points and $\mu$ measures how well the
data are separated. After the early alignment phase, the loss converges to zero
at a $\mathcal{O}(\frac{1}{t})$ rate, and the weight matrix on the first layer
is approximately low-rank. Numerical experiments on the MNIST dataset
illustrate our theoretical findings.

利用小初始化进行梯度流训练的研究，研究了两层 ReLU 网络在二元分类问题中的训练。首层神经元在早期对齐阶段尝试与正或负数据对齐，其方向动态分析得出了神经元达到良好对齐所需的时间上界。在对齐阶段后，损失函数以 1/t 速率收敛到零，首层权重矩阵近似低秩。通过对 MNIST 数据集进行实验验证了理论发现。