In this study, six supervised ML algorithms were employed to build a model to predict diabetes using socio-demographic characteristics in the early stages. Before applying machine learning techniques. Exploratory data analysis (EDA) is performed to explore the hidden knowledge of the applied dataset. Then, ML techniques are conducted to build a potential model to predict diabetes and to find out the most significant socio-demographic risk factors related to diabetes. All the findings of the study are represented in this section.
3.2. Performance Evaluation of ML Models
Six machine learning models such as MLP, SVM, DT, LGBM, XGB, and RF were applied and their performances were compared among each other to find the best-fit model to predict diabetes in the early stage. The results of the ML models are represented in the following sections.
At first, the imbalanced dataset was trained using the train test split method, where 70% of the dataset was utilized to train the model, and 30% of the dataset, was employed for testing the built models. The result of the train test split method on the imbalanced dataset is represented in Table 2. According to Table 2, the lowest performance is generated by SVM and MLP classifiers. RF has the highest accuracy score of 98.44% among the six ML algorithms. Furthermore, RF also gives the maximum scores for the rest of the performance metrics: precision, recall, f1-score, sensitivity, specificity, kappa-statistics, and MCC value, which are respectively 0.9800, 0.9899, 0.9849, 0.9785, 0.9899, 0.9687, and 0.9687.
Table 2.
Performance evaluation on imbalance dataset for train test split method.
Table 2.
Performance evaluation on imbalance dataset for train test split method.
Algorithm |
Accuracy |
Precision |
Recall |
F1-Score |
Sensitivity |
Specificity |
Kappa Statistics |
MCC |
SVM |
92.19% |
0.9117 |
0.9394 |
0.9254 |
0.9032 |
0.9394 |
0.8434 |
0.8438 |
MLP |
93.23% |
0.9388 |
0.9293 |
0.934 |
0.9355 |
0.9293 |
0.8645 |
0.8645 |
LGBM |
94.27% |
0.9782 |
0.9091 |
0.9424 |
0.9785 |
0.9091 |
0.8855 |
0.8879 |
XGB |
96.35% |
0.9791 |
0.9495 |
0.9641 |
0.9785 |
0.9495 |
0.9271 |
0.9275 |
DT |
97.39% |
0.9896 |
0.9596 |
0.9743 |
0.9892 |
0.9596 |
0.9479 |
0.9484 |
RF |
98.44% |
0.98 |
0.9899 |
0.9849 |
0.9785 |
0.9899 |
0.9687 |
0.9687 |
Table 3 shows the results for different performance metrics on the imbalanced dataset. RF has the highest accuracy score of 98.44% among the six ML algorithms. Furthermore, RF also gives the maximum scores for the rest of the performance metrics: precision, recall, f1-score, sensitivity, specificity, kappa-statistics, and MCC value, which are respectively 0.9800, 0.9899, 0.9849, 0.9785, 0.9899, 0.9687, and 0.9687.
Table 3.
Performance evaluation on balance dataset for train test split method.
Table 3.
Performance evaluation on balance dataset for train test split method.
Algorithm |
Accuracy |
Precision |
Recall |
F1-Score |
Sensitivity |
Specificity |
Kappa Statistics |
MCC |
MLP |
93.59% |
0.9423 |
0.9608 |
0.9514 |
0.8889 |
0.9608 |
0.8571 |
0.8575 |
SVM |
93.59% |
0.951 |
0.951 |
0.951 |
0.9074 |
0.951 |
0.8584 |
0.8584 |
DT |
94.87% |
0.9796 |
0.9412 |
0.96 |
0.9629 |
0.9412 |
0.8886 |
0.89 |
LGBM |
98.08% |
1 |
0.9706 |
0.9851 |
1 |
0.9706 |
0.958 |
0.9589 |
XGB |
98.72% |
1 |
0.9804 |
0.9901 |
1 |
0.9804 |
0.9719 |
0.9723 |
RF |
99.37% |
1 |
0.9902 |
0.9951 |
1 |
0.9902 |
0.9859 |
0.986 |
Table 4 represents detailed information about the ML approaches for 5-fold CV results on the balanced dataset. The maximum CV accuracy is 94.87% for RF classifiers. DT shows the highest precision value of 0.9784, and RF gives the highest recall and f1-scores of 0.9608 and 0.9608. At the same time, DT also gains the maximum sensitivity value of 0.9629. The maximum specificity, kappa-statistics, and MCC values given through RF are 0.9608, 0.8867, and 0.8867, respectively.
Table 4.
Performance evaluation on balance dataset for 5-fold CV method.
Table 4.
Performance evaluation on balance dataset for 5-fold CV method.
Algorithm |
Accuracy |
Precision |
Recall |
F1-Score |
Sensitivity |
Specificity |
Kappa statistics |
MCC |
DT |
91.61% |
0.9784 |
0.8921 |
0.9333 |
0.9629 |
0.8921 |
0.8228 |
0.8291 |
RF |
94.87% |
0.9608 |
0.9608 |
0.9608 |
0.9259 |
0.9608 |
0.8867 |
0.8867 |
SVM |
89.74% |
0.8981 |
0.9509 |
0.9238 |
0.7963 |
0.9509 |
0.7673 |
0.7703 |
XGB |
92.95% |
0.9691 |
0.9216 |
0.9447 |
0.9444 |
0.9216 |
0.8475 |
0.8496 |
LGBM |
92.95% |
0.9691 |
0.9216 |
0.9447 |
0.9444 |
0.9216 |
0.8475 |
0.8496 |
MLP |
92.95% |
0.9505 |
0.9412 |
0.9458 |
0.9074 |
0.9412 |
0.8449 |
0.845 |
5-flood CV results on the balanced dataset for the ML approaches have been described in Table 5. According to Table 5, RF and LGBM have the highest maximum accuracy of 93.23%. Besides, the RF and LGBM show the highest precision value of 0.9574. The highest recall and specificity value is 0.9091 which is generated by DT, RF, XGBoost, and LGBM classifiers. Both RF and LGBM show a maximum sensitivity score of 0.9570. RF and LGBM have shown maximum values of f1-score, kappa-statistics, and MCC of 0.9326, 0.8647, and 0.8658, respectively.
Table 5.
Performance evaluation on imbalance dataset for 5-fold CV method.
Table 5.
Performance evaluation on imbalance dataset for 5-fold CV method.
Algorithm |
Accuracy |
Precision |
Recall |
F1-Score |
Sensitivity |
Specificity |
Kappa Statistics |
MCC |
DT |
91.67% |
0.9278 |
0.9091 |
0.9184 |
0.9247 |
0.9091 |
0.8333 |
0.8334 |
RF |
93.23% |
0.9574 |
0.9091 |
0.9326 |
0.957 |
0.9091 |
0.8647 |
0.8658 |
SVM |
87.50% |
0.8947 |
0.8586 |
0.8763 |
0.8925 |
0.8586 |
0.7501 |
0.7507 |
XGB |
92.71% |
0.9474 |
0.9091 |
0.9278 |
0.9462 |
0.9091 |
0.8542 |
0.8549 |
LGBM |
93.23% |
0.9574 |
0.9091 |
0.9326 |
0.957 |
0.9091 |
0.8647 |
0.8658 |
MLP |
91.14% |
0.9271 |
0.8989 |
0.9128 |
0.9247 |
0.8989 |
0.8229 |
0.8233 |
Figure 4 shows the ROC-curve and Precision-Recall (PR) curve for six different ML techniques that have been applied in this study. The results for the balanced dataset are shown in
Figure 4(A) and 4(B). On the other hand,
Figure 4(C) and 4(D) show the results of the imbalanced dataset. For a balanced dataset, the highest AUC score is 1.00 for RF, XGBoost, and LGBM, as shown in the figure. At the same time, RF, XGBoost, and LGBM show the highest AUCPR value, which is also 1.00. On the contrary, the maximum AUC score is 0.999, as shown by RF. Respectably, RF also shows the highest AUCPR value of 0.999 for an imbalanced dataset.
3.5. Discussion
Researchers have conducted a lot of research on diabetes prediction, but there is still room for improvement in diabetes prediction research. For predicting diabetes in this work, we employ a socio-demographic diabetes dataset. After collecting the dataset, we preprocessed it to make it suitable for further analysis. We have applied six supervised ML algorithms DT, RF, SVM, XGBOOST, LGBM, and MLP to predict diabetes. After applying the ML approaches, we assessed the results of the applied ML approaches utilizing different performance metrics like accuracy, precision, recall, f1-measure, sensitivity, specificity, kappa-statistics, and MCC. Among the applied ML algorithms, RF shows the highest result with 99.37% accuracy; 1.00 precision; 0.9902 recall; f1-score is 0.9951; sensitivity is 1.00; 0.9902 specificities; kappa-statistics and MCC 0.9859 and 0.9860 respectively for the train-test-split techniques, which effectively predicts diabetes. The same socio-demographic diabetes dataset that we analyze for diabetes prediction has been also analyzed by Islam, M. M., et al., (2020) and has been shown to have the highest result of 99.00% accuracy; for the RF approaches [
4]. And Ahmed, Usama, et al., (2022) also used the same dataset and got 94.87% accuracy; 0.9552 sensitivity; 0.9438 specificities; f1-score is 0.9412 [
9]. The impact of features on the model has an essential role in the ML field for any disease prediction. So, in this work, we also show the features-impact on model prediction of the six ML algorithms by utilizing the SHAP summary plot, which is graphically expressed in Figure 8.
Therefore, in this work, there are some limitations. First of all, the number of instances in this dataset is only 520, which is enough to build an ML-based prediction model but not good enough. As a result, we should collect more data in the future. The attributes of the diabetes dataset are only socio-demographic, but the socio-demographic data on diabetes are not sufficient to accurately predict diabetes. For that reason, we should collect clinical data in the future and merge them together to build an effective diabetes prediction model. Also, in the future, this study should be focused on utilizing more effective ML approaches to build an effective prediction model and develop an end-user website for diabetes prediction.