Diffusion models have achieved state-of-the-art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task-specific preferences remains challenging, particularly in scenarios where explicit reward functions are unavailable. In this work, we introduce Discrete Diffusion DPO (D3PO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains. Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution. We validate D3PO on a structured binary sequence generation task, demonstrating that the method effectively aligns model outputs with preferences while maintaining structural validity. Our results highlight that D3PO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches. Future research will explore extending D3PO to more complex generative tasks, including language modeling and protein sequence generation, as well as investigating alternative noise schedules, such as uniform noising, to enhance flexibility across different applications.

本研究解决了在缺乏明确奖励函数的情况下，将离散扩散模型与特定任务偏好对齐的挑战。我们提出的D3PO方法通过新颖的损失函数，利用偏好数据直接优化生成过程，同时保持对参考分布的忠实性。研究表明，D3PO在不需要显式奖励模型的情况下，能够有效地对齐模型输出与偏好，提供了一种比强化学习方法更实用的替代方案。

基于偏好的离散扩散模型对齐：D3PO