The use of machine learning algorithms in healthcare can amplify social injustices and health inequities. While the exacerbation of biases can occur and compound during the problem selection, data collection, and outcome definition, this research pertains to some generalizability impediments that occur during the development and the post-deployment of machine learning classification algorithms. Using the Framingham coronary heart disease data as a case study, we show how to effectively select a probability cutoff to convert a regression model for a dichotomous variable into a classifier. We then compare the sampling distribution of the predictive performance of eight machine learning classification algorithms under four training/testing scenarios to test their generalizability and their potential to perpetuate biases. We show that both the Extreme Gradient Boosting, and Support Vector Machine are flawed when trained on an unbalanced dataset. We introduced and show that the double discriminant scoring of type I is the most generalizable as it consistently outperforms the other classification algorithms regardless of the training/testing scenario. Finally, we introduce a methodology to extract an optimal variable hierarchy for a classification algorithm, and illustrate it on the overall, male and female Framingham coronary heart disease data.

利用机器学习算法在医疗保健领域可能会放大社会不公和卫生不平等问题；本研究关注于机器学习分类算法在开发和使用过程中遇到的一些普遍性障碍，通过以弗雷明汉冠心病数据为案例，说明了如何选择概率阈值将回归模型转换为分类器，并比较了八种常用机器学习分类算法在不同训练/测试场景下的预测性能，以测试它们的普适性和可能引发的偏见问题；得到的研究结果表明，XGBoost和支持向量机在不平衡数据集上训练存在缺陷，而双重判别式为I型是最具普适性的，它在各种训练/测试场景下都始终优于其他分类算法；最后，提出了一种用于分类算法的最佳变量层次结构提取方法，并以全量数据、男性和女性的弗雷明汉心脏病数据进行了说明。

机器学习分类算法的比较及其在弗雷明翰心脏研究中的应用