Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis. However, synthetic data generation via prompting LLMs remains challenging due to LLMs' limited understanding of target data distributions and the complexity of prompt engineering, especially for structured formatted data. To address these issues, we introduce DiffLM, a controllable data synthesis framework based on variational autoencoder (VAE), which further (1) leverages diffusion models to reserve more information of original distribution and format structure in the learned latent distribution and (2) decouples the learning of target distribution knowledge from the LLM's generative objectives via a plug-and-play latent feature injection module. As we observed significant discrepancies between the VAE's latent representations and the real data distribution, the latent diffusion module is introduced into our framework to learn a fully expressive latent distribution. Evaluations on seven real-world datasets with structured formatted data (i.e., Tabular, Code and Tool data) demonstrate that DiffLM generates high-quality data, with performance on downstream tasks surpassing that of real data by 2-7 percent in certain cases. The data and code will be publicly available upon completion of internal review.

本研究解决了利用大型语言模型生成高质量合成数据时面临的目标数据分布理解不足和复杂提示工程的问题。我们提出的DiffLM框架结合变分自编码器和扩散模型，通过解耦目标分布知识的学习和生成目标，实现了更高的信息保留和格式结构的控制。评估结果显示，DiffLM在七个真实世界数据集上的下游任务性能超过了真实数据2-7%。

DiffLM：通过扩散语言模型进行可控合成数据生成