How to eliminate pronominal reference in group chats? In this work, we have preprocessed 58k authentic chat data and manually annotated 2.3k questions. The reliability of this annotation was confirmed by the scaling law. After this, we conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters. The optimal version improved 29.07 in F1 score. This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks. Our contributions are: 1) Created Supervised Fine-Tuning (SFT) training data in alpaca format, along with a set of Low-Rank Adaptation (LoRA) weights, and 2) Developed a method for acquiring high-quality data leveraging scaling law principle. The script, raw data with alpaca format and experiments track are open-sourced on Github https://github.com/InternLM/HuixiangDou/tree/main/web/tools, HuggingFace https://huggingface.co/tpoisonooo and WandB https://wandb.ai/tpoisonooo/huixiangdou-cr/table?nw=nwusertpoisonooo . The privacy of the data involved has been authorized by users.

如何消除群聊中的代词引用？本文通过预处理58k条真实聊天数据并手动标注2.3k个问题，验证了该标注的可靠性；然后对从0.5B到32B参数范围内的Qwen模型进行微调，最佳版本F1得分提高了29.07，确认了利用大型语言模型（LLM）进行下游自然语言处理（NLP）任务的可行性；我们的贡献是:1)创建了以alpaca格式的有监督微调(SFT)训练数据，包括一组低秩适应(LoRA)权重；2)开发了一种基于缩放定律原理获取高质量数据的方法；脚本、以alpaca格式的原始数据和实验跟踪已在Github、HuggingFace和WandB上开源；数据隐私经用户授权。

HuixiangDou-CR：群聊中的共指消解