Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other. This work sheds light on those open questions. by investigating the key factors influencing ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each), and train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features. We discover no correlation between packing and classification accuracy, and that missing behaviors in dynamically-extracted features highly penalize their performance. We also demonstrate how a larger number of families to classify make the classification harder, while a higher number of samples per family increases accuracy. Finally, we find that models trained on a uniform distribution of samples per family better generalize on unseen data.

调查了影响基于机器学习的恶意软件检测和分类的关键因素，并发现静态特征优于动态特征，并且结合二者只能稍微改善静态特征的性能。不同包装方式与分类准确性之间没有关联，而在动态提取特征中缺少行为极大地惩罚了它们的性能。较大数量的待分类家族使分类变得更困难，而每个家族的样本数越多，准确性越高。最后，发现在每个家族的样本均匀分布的情况下训练的模型对未见数据更好地推广。

恶意软件分类中机器学习的解密: 数据集、特征提取和模型性能的深入探究