Scale has become a main ingredient in obtaining strong machine learning
models. As a result, understanding a model's scaling properties is key to
effectively designing both the right training setup as well as future
generations of architectures. In this work, we argue that scale and training
research has been needlessly complex due to reliance on the cosine schedule,
which prevents training across different lengths for the same model size. We
investigate the training behavior of a direct alternative - constant learning
rate and cooldowns - and find that it scales predictably and reliably similar
to cosine. Additionally, we show that stochastic weight averaging yields
improved performance along the training trajectory, without additional training
costs, across different scales. Importantly, with these findings we demonstrate
that scaling experiments can be performed with significantly reduced compute
and GPU hours by utilizing fewer but reusable training runs. Our code is
available at this https URL

通过研究模型的规模和训练行为，本研究提出了常数学习率和冷却方法作为替代余弦调度的更简单且可预测可靠的训练方法，并发现随机权重平均可以在不增加额外训练成本的情况下改善训练过程中的性能，从而减少计算和 GPU 时间，实现规模实验的效率提升。

超越固定训练持续时间的尺度定律和计算优化训练

Scaling Laws and Compute-Optimal Training Beyond Fixed Training  Durations

Denoising Diffusion Probabilistic models have become increasingly popular due
to their ability to offer probabilistic modeling and generate diverse outputs.
This versatility inspired their adaptation for image segmentation, where
multiple predictions of the model can produce segmentation results that not
only achieve high quality but also capture the uncertainty inherent in the
model. Here, powerful architectures were proposed for improving diffusion
segmentation performance. However, there is a notable lack of analysis and
discussions on the differences between diffusion segmentation and image
generation, and thorough evaluations are missing that distinguish the
improvements these architectures provide for segmentation in general from their
benefit for diffusion segmentation specifically. In this work, we critically
analyse and discuss how diffusion segmentation for medical images differs from
diffusion image generation, with a particular focus on the training behavior.
Furthermore, we conduct an assessment how proposed diffusion segmentation
architectures perform when trained directly for segmentation. Lastly, we
explore how different medical segmentation tasks influence the diffusion
segmentation behavior and the diffusion process could be adapted accordingly.
With these analyses, we aim to provide in-depth insights into the behavior of
diffusion segmentation that allow for a better design and evaluation of
diffusion segmentation methods in the future.

对扩散分割与图像生成之间的区别进行分析和讨论，重点关注训练行为，评估直接用于分割的扩散分割架构的表现，以及不同医学分割任务对扩散分割行为的影响及相应的扩散过程的调整方法。通过这些分析，旨在为未来扩散分割方法的设计和评估提供深入见解。