When using Stochastic Gradient Descent (SGD) for training machine learning models, it is often crucial to provide the model with examples sampled at random from the dataset. However, for large datasets stored in the cloud, random access to individual examples is often costly and inefficient. A recent work \cite{corgi}, proposed an online shuffling algorithm called CorgiPile, which greatly improves efficiency of data access, at the cost some performance loss, which is particularly apparent for large datasets stored in homogeneous shards (e.g., video datasets). In this paper, we introduce a novel two-step partial data shuffling strategy for SGD which combines an offline iteration of the CorgiPile method with a subsequent online iteration. Our approach enjoys the best of both worlds: it performs similarly to SGD with random access (even for homogenous data) without compromising the data access efficiency of CorgiPile. We provide a comprehensive theoretical analysis of the convergence properties of our method and demonstrate its practical advantages through experimental results.

利用 Stochastic Gradient Descent（SGD）训练机器学习模型时，为了提高模型效果，通常需要提供从数据集中随机抽样的示例。然而，在云中存储的大型数据集中，对个别示例的随机访问通常既昂贵又低效。最近的一项工作中提出了一种名为CorgiPile的在线洗牌算法，该算法极大提高了数据访问的效率，但会导致一些性能损失，尤其在存储在同质分区中的大型数据集上。本文引入了一种新的两步部分数据洗牌策略，该策略结合了CorgiPile方法的离线迭代和随后的在线迭代，从而兼具了两者的优势：它的性能类似于随机访问的SGD（即使对于同质数据），同时不会影响CorgiPile的数据访问效率。我们对该方法的收敛特性进行了全面的理论分析，并通过实验结果展示了它的实际优势。

Corgi^2: 一种混合离线-在线方法用于面向存储感知的SGD数据洗牌