BriefGPT.xyz
Jun, 2023
通过沿广义测地线插值生成合成数据集
Generating Synthetic Datasets by Interpolating along Generalized Geodesics
HTML
PDF
Jiaojiao Fan, David Alvarez-Melis
TL;DR
本文提出了一种新的基于最优输运(OT)理论概念的多数据集插值方法来合成具有目标数据集相似度的数据集,为目标域下的预训练提供了可行方案。
Abstract
Data for
pretraining
machine learning
models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target do
→