Scale has become a main ingredient in obtaining strong machine learning
models. As a result, understanding a model's scaling properties is key to
effectively designing both the right training setup as well as future
generations of architectures. In this work, we argue that scale and training
research has been needlessly complex due to reliance on the cosine schedule,
which prevents training across different lengths for the same model size. We
investigate the training behavior of a direct alternative - constant learning
rate and cooldowns - and find that it scales predictably and reliably similar
to cosine. Additionally, we show that stochastic weight averaging yields
improved performance along the training trajectory, without additional training
costs, across different scales. Importantly, with these findings we demonstrate
that scaling experiments can be performed with significantly reduced compute
and GPU hours by utilizing fewer but reusable training runs. Our code is
available at this https URL

通过研究模型的规模和训练行为，本研究提出了常数学习率和冷却方法作为替代余弦调度的更简单且可预测可靠的训练方法，并发现随机权重平均可以在不增加额外训练成本的情况下改善训练过程中的性能，从而减少计算和 GPU 时间，实现规模实验的效率提升。

超越固定训练持续时间的尺度定律和计算优化训练

Scaling Laws and Compute-Optimal Training Beyond Fixed Training  Durations

Domain generalization (DG) aims to learn a generalized model to an unseen
target domain using only limited source domains. Previous attempts to DG fail
to learn domain-invariant representations only from the source domains due to
the significant domain shifts between training and test domains. Instead, we
re-formulate the DG objective using mutual information with the oracle model, a
model generalized to any possible domain. We derive a tractable variational
lower bound via approximating the oracle model by a pre-trained model, called
Mutual Information Regularization with Oracle (MIRO). Our extensive experiments
show that MIRO significantly improves the out-of-distribution performance.
Furthermore, our scaling experiments show that the larger the scale of the
pre-trained model, the greater the performance improvement of MIRO. Source code
is available at this https URL

使用 mutual information regularization 和 oracle 模型，通过一个预训练模型推导了一个可行的变分下界，证明在缩放实验中，预训练模型的规模越大，MIRO 的性能改善越好。

使用预训练模型的互信息正则化实现领域泛化

Domain Generalization by Mutual-Information Regularization with  Pre-trained Models

We explore the trade-offs of performing linear algebra using Apache Spark,
compared to traditional C and MPI implementations on HPC platforms. Spark is
designed for data analytics on cluster computing platforms with access to local
disks and is optimized for data-parallel tasks. We examine three widely-used
and important matrix factorizations: NMF (for physical plausability), PCA (for
its ubiquity) and CX (for data interpretability). We apply these methods to
TB-sized problems in particle physics, climate modeling and bioimaging. The
data matrices are tall-and-skinny which enable the algorithms to map
conveniently into Spark's data-parallel model. We perform scaling experiments
on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide
tuning guidance to obtain high performance.

探讨在 HPC 平台上使用 Apache Spark 进行线性代数与传统的 C 和 MPI 实现之间的权衡。我们考察了三种常用的矩阵分解方法：NMF、PCA 和 CX，并将它们应用于 TB 级的问题，包括粒子物理学、气候模拟和生物成像。我们在高达 1600 个 Cray XC40 节点上进行了扩展性实验，描述了减速的来源，并提供调整指南以获得高性能。