Diabetes Risk Prediction Model Using Machine Learning

Boyi Yang

doi:10.62051/nzr6tw29

Authors

Boyi Yang

DOI:

https://doi.org/10.62051/nzr6tw29

Keywords:

Diabetes prediction; Machine learning; Receiver Operating Characteristic (ROC); The area under the ROC curve (AUC); Accuracy.

Abstract

Diabetes is a major global health challenge, contributing to increased mortality and long-term complications worldwide. Early diagnosis and effective risk stratification are critical to reducing the disease burden. This research aims to evaluate and compare the predictive performance of five machine learning (ML) models—Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Gradient Boosting (GB), and Support Vector Machine (SVM)—using the Pima Indians Diabetes Dataset. A standardized experimental workflow involving data preprocessing, missing value imputation, feature scaling, model training was applied. Performance metrics such as accuracy, precision, recall, F1 score, and area under the ROC curve (AUC) were used to evaluate model outcomes. Among the models tested, Gradient Boosting achieved the highest accuracy (75.97%), whereas Random Forest attained the highest AUC (0.833), indicating its superior classification capability. These results demonstrate that Random Forest model, offers a promising and practical approach for implementing robust diabetes risk prediction tools in clinical or public health contexts.

Downloads

Download data is not yet available.

References

[1] Global increase in diabetes prevalence imposes a substantial health and economic burden: Published by Journal of Health Economics and Outcomes Research. Journal of Health Economics and Outcomes Research, 2021. Available: https://jheor.org/post/1265-global-increase-in-diabetes-prevalence-imposes-a-substantial-health-and-economic-burden.

[2] Global Burden of Disease Collaborative Network. Global Burden of Disease Study 2021. Results. Institute for Health Metrics and Evaluation, 2024.

[3] Sisodia, D., & Sisodia, D. S. Prediction of diabetes using classification algorithms. Procedia Computer Science, 132, 1578 - 1585, 2018.

[4] Laila, U. E., Mahboob, K., Khan, A. W., Khan, F., & Taekeun, W. An ensemble approach to predict early-stage diabetes risk using machine learning: An empirical study. Sensors, 22 (14), 5247, 2022.

[5] Hasan, M. K., Saeed, R. A., Alsuhibany, S. A., & Abdel-Khalek, S. An empirical model to predict the diabetic positive using stacked ensemble approach. Frontiers in Public Health, 9, 792124, 2022.

[6] Linear regression. scikit-learn. (n.d.). Available: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html.

[7] Decision trees. scikit-learn. (n.d.). Available: https://scikit-learn.org/stable/modules/tree.html#decision-trees.

[8] Random forests. scikit-learn. (n.d.). Available: https://scikit-learn.org/stable/modules/ensemble.html#random-forests.

[9] Histogram-Based Gradient Boosting. scikit-learn. (n.d.). Available: https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting.

[10] Support vector machines. scikit-learn. (n.d.). Available: https://scikit-learn.org/stable/modules/svm.html#support-vector-machines.