Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can then be fine-tuned for the question-answering (QA) task. However, effective strategies for fine-tuning LLMs for the QA task remain largely unexplored. To address this gap, we categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs and conduct a series of empirical analyses. Our experiments, involving four LLMs from three different model families, focus on three key factors: the amount of data required for SFT, the impact of different SFT datasets on model performance, and how data requirements vary across LLMs. The results show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task. Additionally, SFT with data of varying memory levels has a significant impact on LLM performance, with the optimal dataset differing based on the specific model being fine-tuned. Future research will delve deeper into the mechanisms underlying these phenomena.

本研究针对大规模语言模型在问答任务中的微调策略这一尚未充分探索的问题，提出了一种基于预训练模型记忆知识程度对监督微调数据进行分类的方法。研究发现，在微调阶段仅需60个数据点即可激活预训练中编码的知识，且不同记忆水平的数据对模型性能有显著影响，具体最佳数据集因模型而异。

关于大规模语言模型在问答任务中的微调的实证见解