The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage synthetic data generation framework that incorporates World Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to counterparts. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling for synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.

本研究解决了大型语言模型（LLMs）在高质量监督细化（SFT）数据短缺问题，提出了一种名为Condor的两阶段合成数据生成框架。Condor结合了世界知识树和自我反思精炼，能够大规模生成高质量的SFT数据，实验证明仅使用Condor生成的2万个样本微调的基础模型，其性能优于对照组，并且该框架的迭代自我改进能力为各种规模的LLMs提供了验证其有效性的途径。

Condor：通过知识驱动的数据合成与精炼增强大型语言模型的对齐