Instruction fine-tuning pretrained LLMs for diverse downstream tasks has
demonstrated remarkable success and has captured the interest of both academics
and practitioners. To ensure such fine-tuned LLMs align with human preferences,
techniques such as RLHF and DPO have emerged. At the same time, there is
increasing interest in smaller parameter counts for models. In this work, using
OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the
OpenBezoar family of models. In this recipe: We first generate synthetic
instruction fine-tuning data using an open and commercially non-restrictive
instruction fine-tuned variant of the Falcon-40B model under three schemes
based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a
seed dataset) and Orca (with the Flan Collection as a seed dataset), then
filter these generations using GPT-4 as a human proxy. We then perform
cost-effective QLoRA-based supervised fine-tuning sequentially with each
scheme. The resulting checkpoint is further fine-tuned with a subset of the
HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to
obtain the final checkpoint. Evaluation is done with the LM Eval Harness
tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with
Claude 2.1, with the finding that the final checkpoint,
"OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at
the 3B parameter scale, even outperforming the top model in one of the
categories on the Huggingface Open LLM Leaderboard. We release
"OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO"
checkpoints, alongside our generated datasets on HuggingFace at
this https URL
and our codebase at
this https URL

使用基于 OpenLLaMA 3Bv2 的基本模型，我们描述了用于微调 OpenBezoar 系列模型的配方，并证明了最终检查点 “OpenBezoar-HH-RLHF-DPO” 在 3B 参数规模上胜过许多其他模型。

OpenBezoar: 小型、经济高效且开放式模型用于混合指导数据训练

OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of  Instruction Data

By allowing models to predict without task-specific training, in-context
learning (ICL) with pretrained LLMs has enormous potential in NLP. However, a
number of problems persist in ICL. In particular, its performance is sensitive
to the choice and order of in-context examples. Given the same set of
in-context examples with different orderings, model performance may vary
between near random to near state-of-the-art. In this work, we formulate
in-context example ordering as an optimization problem. We examine three
problem settings that differ in the assumptions they make about what is known
about the task. Inspired by the idea of learning from label proportions, we
propose two principles for in-context example ordering guided by model's
probability predictions. We apply our proposed principles to thirteen text
classification datasets and nine different autoregressive LLMs with 700M to 13B
parameters. We demonstrate that our approach outperforms the baselines by
improving the classification accuracy, reducing model miscalibration, and also
by selecting better in-context examples.

通过优化问题，研究通过预训练语言模型的上下文学习中的示例排序，以提高文本分类的准确性和选择更好的上下文示例。

由标签分布指导的上下文示例排序

In-Context Example Ordering Guided by Label Distributions

Among the many tasks that Large Language Models (LLMs) have revolutionized is
text classification. However, existing approaches for applying pretrained LLMs
to text classification predominantly rely on using single token outputs from
only the last layer of hidden states. As a result, they suffer from limitations
in efficiency, task-specificity, and interpretability. In our work, we
contribute an approach that uses all internal representations by employing
multiple pooling strategies on all activation and hidden states. Our novel
lightweight strategy, Sparsify-then-Classify (STC) first sparsifies
task-specific features layer-by-layer, then aggregates across layers for text
classification. STC can be applied as a seamless plug-and-play module on top of
existing LLMs. Our experiments on a comprehensive set of models and datasets
demonstrate that STC not only consistently improves the classification
performance of pretrained and fine-tuned models, but is also more efficient for
both training and inference, and is more intrinsically interpretable.

我们的研究提出了一种使用所有内部表示的方法，通过在所有激活和隐藏状态上采用多种池化策略，首先逐层稀疏化特定于任务的特征，然后在层之间进行聚合，用于文本分类。我们的实验证明，STC 不仅在预训练和微调模型上稳定提高了分类性能，而且在训练和推断速度上更加高效，具有更强的内在可解释性。

稀疏化再分类：从大型语言模型的内部神经元到高效的文本分类器

Sparsify-then-Classify: From Internal Neurons of Large Language Models  To Efficient Text Classifiers

Large Language Models (LLMs) with a billion or more parameters are prime
targets for network pruning, which aims to reduce a portion of the network
weights without compromising performance. Prior approaches such as Weights
Magnitude, SparseGPT, and Wanda, either concentrated solely on weights or
integrated weights with activations for sparsity. However, they overlooked the
informative gradients derived from pretrained large language models. In this
paper, we present a novel sparsity-centric pruning method for pretrained LLMs,
termed Gradient-based Language Model Pruner (GBLM-Pruner). GBLM-Pruner
leverages the first-order term of the Taylor expansion, operating in a
training-free manner by harnessing properly normalized gradients from a few
calibration samples to determine the importance pruning score, and
substantially outperforms competitive counterparts like SparseGPT and Wanda in
multiple benchmarks. Intriguing, after incorporating gradients, the
unstructured pruning method tends to reveal some structural patterns
post-pruning, which mirrors the geometric interdependence inherent in the LLMs'
parameter structure. Additionally, GBLM-Pruner functions without any subsequent
retraining or weight updates to maintain its simplicity as other counterparts.
Extensive evaluations on LLaMA-1 and LLaMA-2 across various language benchmarks
and perplexity show that GBLM-Pruner surpasses magnitude pruning, Wanda
(weights+activations) and SparseGPT (weights+activations+weight update) by
significant margins. Our code and models are available at
this https URL

预训练的大型语言模型的梯度为基础的模型修剪器（GBLM-Pruner）通过利用卡尔曼几何中的几何相互关联性明显胜过其他竞争对手，并在各种语言评估中超过了幅度修剪、Wanda 和 SparseGPT。