In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.

本论文提出了一种新的生成方法——DARL，它使用仅有解码器的Transformer来自主预测图像块。我们发现仅使用均方差（Mean Squared Error，MSE）进行训练可以得到强大的表示。为了增强图像生成能力，我们用去噪补丁解码器替换了均方差损失。我们证明通过使用特定的噪声时间表并用更大的模型进行更长的训练可以改善学习到的表示。值得注意的是，最佳时间表与标准图像扩散模型中使用的典型时间表有显著差异。总体上，尽管DARL的架构简单，但在微调协议下其性能几乎与先进的掩码预测模型相当。这是在视觉感知和生成两方面功能上能够合并自回归和去噪扩散模型优势的重要一步。

去噪自回归表示学习