Large language models (LLMs) are being increasingly tuned to power complex generation tasks such as writing, fact-seeking, querying and reasoning. Traditionally, human or model feedback for evaluating and further tuning LLM performance has been provided at the response level, enabling faster and more cost-effective assessments. However, recent works (Amplayo et al. [2022], Wu et al. [2023]) indicate that sentence-level labels may provide more accurate and interpretable feedback for LLM optimization. In this work, we introduce methods to disaggregate response-level labels into sentence-level (pseudo-)labels. Our approach leverages multiple instance learning (MIL) and learning from label proportions (LLP) techniques in conjunction with prior information (e.g., document-sentence cosine similarity) to train a specialized model for sentence-level scoring. We also employ techniques which use model predictions to pseudo-label the train-set at the sentence-level for model training to further improve performance. We conduct extensive evaluations of our methods across six datasets and four tasks: retrieval, question answering, summarization, and math reasoning. Our results demonstrate improved performance compared to multiple baselines across most of these tasks. Our work is the first to develop response-level feedback to sentence-level scoring techniques, leveraging sentence-level prior information, along with comprehensive evaluations on multiple tasks as well as end-to-end finetuning evaluation showing performance comparable to a model trained on fine-grained human annotated labels.

我们介绍了一种将响应级别标签细分为句子级别（伪）标签的方法，该方法利用多实例学习（MIL）和学习标签比例（LLP）技术以及先前信息训练专用模型进行句子级别评分，并利用模型预测对训练集进行伪标签，以进一步提高性能。我们在六个数据集和四个任务上进行了广泛的评估，结果表明在大多数任务中与多个基准方法相比，我们的方法性能有所提高。这项工作是第一个将响应级别反馈应用到句子级别评分技术，并利用句子级别先前信息进行全面评估的工作，同时进行了端到端微调评估，表明性能与基于精细人工标注标签训练的模型相当。

FRACTAL：基于文本标签的细粒度评分