Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding modern data augmentation techniques. We start by showing that for kernel classifiers, data augmentation can be approximated by first-order feature averaging and second-order variance regularization components. We connect this general approximation framework to prior work in invariant kernels, tangent propagation, and robust optimization. Next, we explicitly tackle the compositional aspect of modern data augmentation techniques, proposing a novel model of data augmentation as a Markov process. Under this model, we show that performing $k$-nearest neighbors with data augmentation is asymptotically equivalent to a kernel classifier. Finally, we illustrate ways in which our theoretical framework can be leveraged to accelerate machine learning workflows in practice, including reducing the amount of computation needed to train on augmented data, and predicting the utility of a transformation prior to training.

本文提出了一个理论框架来理解数据增强技术，并从马尔科夫过程和核分类器两个方向进行分析。研究发现，数据增强可以通过一阶特征平均和二阶方差正则化组件来实现近似。本文还将理论应用于加速机器学习工作流，并证明其在预测变换效用和减少使用增强数据所需计算量方面具有实用价值。

现代数据增强的内核理论