使用真实数据和替代数据进行学习的规模定律

Feb, 2024

使用真实数据和替代数据进行学习的规模定律

Scaling laws for learning with real and surrogate data

Ayush Jain, Andrea Montanari, Eren Sasoglu

TL;DR整合替代数据对模型训练的测试误差有显著减少作用，所需使用经验风险最小化进行加权至关重要，模型训练中真实与替代数据混合的测试误差可通过标度律预测最优加权及替代数据的利益。

Abstract

Collecting large quantities of high-quality data is often prohibitively expensive or impractical, and a crucial bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources like public datasets, data collected under different circumstances, or synthesized by generat