Data augmentation forms the cornerstone of many modern machine learning training pipelines; yet, the mechanisms by which it works are not clearly understood. Much of the research on data augmentation (DA) has focused on improving existing techniques, examining its regularization effects in the context of neural network over-fitting, or investigating its impact on features. Here, we undertake a holistic examination of the effect of DA on three different classifiers, convolutional neural networks, support vector machines, and logistic regression models, which are commonly used in supervised classification of imbalanced data. We support our examination with testing on three image and five tabular datasets. Our research indicates that DA, when applied to imbalanced data, produces substantial changes in model weights, support vectors and feature selection; even though it may only yield relatively modest changes to global metrics, such as balanced accuracy or F1 measure. We hypothesize that DA works by facilitating variances in data, so that machine learning models can associate changes in the data with labels. By diversifying the range of feature amplitudes that a model must recognize to predict a label, DA improves a model's capacity to generalize when learning with imbalanced data.

本研究通过实验检验了数据增强对神经网络、支持向量机和逻辑回归模型的影响，发现它可以帮助模型更好地泛化，在处理不平衡数据分类问题时效果显著。其中一个机理是通过促进数据的差异性，使得机器学习模型能够将数据的变化与标签关联起来，从而提高了模型的泛化能力。

探究数据增强在不平衡数据中的作用