Scaling laws for large language models (LLMs) have provided useful guidance on how to train ever larger models for predictable performance gains. Time series forecasting shares a similar sequential structure to language, and is amenable to large-scale transformer architectures. Here we show that foundational decoder-only time series transformer models exhibit analogous scaling-behavior to LLMs, while architectural details (aspect ratio and number of heads) have a minimal effect over broad ranges. We assemble a large corpus of heterogenous time series data on which to train, and establish, for the first time, power-law scaling relations with respect to parameter count, dataset size, and training compute, spanning five orders of magnitude.

基于大规模语言模型的缩放定律已经在如何训练规模更大的模型以获得可预测的性能提升上提供了有用的指导。该研究表明，基于解码器的时间序列变换模型也展示了与语言模型类似的缩放行为，对于广泛范围内的架构细节（纵横比和头数）几乎没有影响。我们汇集了大量的异构时间序列数据进行训练，并首次建立了参数数量、数据集大小和训练计算量与其之间的幂律缩放关系，涵盖了五个数量级。

大型时间序列模型的尺度定律