Large Language Models (LLMs) are so powerful that they sometimes learn
correlations between labels and features that are irrelevant to the task,
leading to poor generalization on out-of-distribution data. We propose
explanation-based finetuning as a novel and general approach to mitigate LLMs'
reliance on spurious correlations. Unlike standard finetuning where the model
only predicts the answer given the input, we finetune the model to additionally
generate a free-text explanation supporting its answer. To evaluate our method,
we finetune the model on artificially constructed training sets containing
different types of spurious cues, and test it on a test set without these cues.
Compared to standard finetuning, our method makes models remarkably more robust
against spurious cues in terms of accuracy drop across four classification
tasks: ComVE (+1.2), CREAK (+9.1), e-SNLI (+15.4), and SBIC (+6.5). Moreover,
our method works equally well with explanations generated by the model,
implying its applicability to more datasets without human-written explanations.

本文提出了基于解释的微调作为一种缓解大型语言模型依赖错误相关的新颖通用方法，并在人工构建的训练集上微调模型，使其更加强壮。与标准微调不同，我们不仅仅针对输入进行预测，还微调模型以生成支持其答案的自由文本解释。与标准微调相比，我们的方法在四个分类任务中使模型对伪线索具有明显更强的稳健性。此外，我们的方法同样适用于由模型生成的解释，暗示了其在更多数据集上的适用性。