I. Introduction
Diabetes mellitus (otherwise often referred to as diabetes) remains one of the popular life-threatening diseases which affects relatively 500 million people worldwide. It is a chronic disease associated with a high blood sugar level in a human’s body [
1,
2]. The pancreas is an organ in the human body that produces a special hormone known as insulin [
3]. Insulin is released by the pancreas into the bloodstream, aiding in the transport of glucose into the cells [
4]. Diabetes is a condition in which the pancreas is unable to make insulin or in which the body is unable to use insulin as it should. Due to its relatively long asymptomatic phase, its early detection has been receiving massive attention from both medical and non-medical scientists. Diabetes mellitus is known to manifest in two types: Type I and Type II [
5]. The former occurs when the pancreatic beta cells are mistakenly attacked by the immune system and the body produces too little – or none at all – insulin while in the case of the latter, the body does not produce enough or becomes actively resistant to insulin. The third, but not so common type of diabetes is gestational diabetes; the case of which a woman becomes diabetic during pregnancy due to hormonal changes. Diabetes mellitus is known to exhibit symptoms such as polyuria, polydipsia, polyphagia, sudden weight loss (usually Type I [
6]), weakness, obesity (usually Type II [
7]), delayed healing, visual blurring, itching, irritability, genital thrush, partial paresis, muscle stiffness, alopecia, etc.
The alarming fatality of this popular disease is evident from the facts that 85% of diabetic patients were from low- and middle-income countries and that its clinical detection takes so long that patients may gradually have started suffering from other diabetes-causing diseases such as heart attacks, stroke, hypertension, blurry vision, blindness, foot ulcer, amputation, kidney damage and other organ failures [
8,
9]. These symptoms set in due to the number of years (7-12) the disease has gone without notice or treatment. In fact, the degree of severity of its manifestation and associated complications correlates with its detection period. This makes early diagnosis, early commencement of treatment as well as early awareness of patient’s risk factors to contribute to the reduction of its prevalence globally, thereby beneficial in terms of the patient’s health and expenditure [
10]. Identification of risk and protective factors is a key component in diseases which are incurable, confusable and takes a long time to manifest [
11]. These factors promote awareness, prevents the disease, influence people’s lifestyles towards avoiding the disease, fosters effective prevention and suggests routines that serve as positive countermeasures.
Several statistics- and machine-learning-based studies are being conducted daily to predict and diagnose diabetes [
12,
13,
14,
15]. The advent of technology has revolutionized many sectors including healthcare and medical technologies. It helps in the improvement of services offered to patients and serves as an efficient and effective measures of treating, diagnosing, service delivery, information handling, administration etc [
16,
17]. In the recent past, machine learning models have centered on the use of supervised deep learning and classical machine learning models for the prediction and determination of the Type-II diabetes risk factors. However, in this study, we propose an unsupervised approach to this study. A deep belief neural network is proposed to determine the risk factors of Type-II diabetes and the impact of ensemble feature selection was measured.
Structurally, the next section discusses some major works done in relation to the study conducted in this paper. The section which follows the related works shall discuss methods in terms of the methodology, data and evaluation techniques of the proposed DBN model. The following sections present and discuss the results as obtained in the model; then we conclude the paper by highlighting major discoveries in this research and charts future directions for the subject matter.
III. Methods
a. Dataset
The diabetes dataset used in this study was obtained from a publicly available online repository. It contains the response obtained from 520 subjects (who recently became diabetic or are currently showing symptoms of diabetes) using a direct questionnaire. This was released by Sylhet Diabetes Hospital of Sylhet, Bangladesh. It consists of the age, sex, Boolean response to each diabetes-related question and the class to which each person belongs after medical diagnosis (Positive or Negative). There are 16 attributes for each subject under consideration, the summary of which is presented in
Table 2.
The age distribution of the data is given in
Figure 1. It shows that our data is normally distributed.
b. Model Development Workflow
The proposed model development comprises of the preprocessing, ensemble feature selection with the final voting, the DBN pretraining and the finetuning backpropagation for classification. Performance analysis is used to measure the level of satisfaction and confidence accrued to the proposed model. These stages are diagrammatically represented in
Figure 2.
- vi.
Preprocessing: This stage ensures that the diabetes dataset to be used is well prepared for the machine learning task [
42]. This stage ensures the quality of the dataset in terms of noise and duplicate removal, outlier detection and processing, encoding for a numerical representation of categorical and nominal variables [
43]. In the diabetes dataset, all corresponding Yes/No and Male/Female values were replaced with 1/0 respectively. The ages were encoded from 0 – 7 based on the categorization specified in
Figure 1 and these values were normalized using the Min-Max normalization [
44] to prevent the age column from outweighing other columns during prediction, thereby reducing bias. The output of this stage is a ready dataset for further analyses and model pre-training.
- vi.
vii. Ensemble Feature Selection: This study uses the ensemble dimensionality reduction framework to select the best feature set for the developed deep learning model while removing redundant features from the dataset. This will avoid misfits, either overfitting or underfitting as well as reduce the curse and complexity of multidimensionality [
45,
46]. The ensemble selection leverages on the individual strengths of each candidate feature selection method to find the best feature vectors for the deep learning models. The output of this stage is a “project” or subset of the original dataset.
- vi.
viii. Building, Pretraining and Finetuning the DBN Model: This step comprises of the actual stacking of Restricted Boltzmann Machines (RBMs) [
47] to form a deep net and training. DBN is a generative-graph multi-layered model. The process in which the model is used to predict either in a supervised or unsupervised manner is known as pre-training. Each of the deep – hidden – layers is trained as RBMs. The first stage of training DBN is to train layers sequentially from the bottom visible (observed) layer features. This input layer contains
D number of units, where
D is input sample dimension. This input layer is fully connected with hidden layers. Each Hidden layer consists of
N number of RBM. The output layer consists of one unit which defines the class. The final phase, called fine tuning is to train the second layer based on the results from pre-training step. Finally, the entire hidden layers are learned same way till final hidden layer is reached. The
Figure 3 outlines the architecture of model pre-training proposed for our study.
There features in the input layer are an output of the voting ensemble feature selection procedure containing nine features from the possible sixteen features. There are three hidden layers in our DBN model. The output layer is the class to which each instance in our dataset is classified into (Positive/Negative).
- ix.
Performance Analysis: Our proposed DBN model for diabetes risk prediction was assessed using F1-Measure, Precision and Recall, where
where TP is True Positive, FP is False Positive and FN is False Negative as all obtained from the confusion matrix of the result.
IV. Results and Discussion
In this study, we developed a voting ensemble feature selection method which consisted of Chi-Square (CS), Mutual Information Gain (MIG) and Variance Threshold (VT) methods. Top ten methods were selected and prepared to run in the DBN pretraining for the prediction of diabetes mellitus. The parameters were tuned to achieve the most optimal accuracy obtainable. The top ten feature sets were also passed through five benchmark models (KNN, Linear SVM, Logistic Regression, Decision Trees and Random Forests) for performance comparison. We also performed correlation analysis by plotting the correlation matrix in order to determine prior to modeling if there is any overfitting. The categorical nature of the diabetes dataset required that Spearman correlation be used and not Pearson [
48]. The result of the voting ensemble feature selection process is given in
Table 3. The ensemble voting screened out age, itching and obesity as possible early predictors of diabetes mellitus with obesity having the lowest rank by our three feature rankers. The feature sets which were voted by our stack selectors are sex, polyuria, polydipsia, sudden weight loss, weakness, polyphagia, genital thrush, visual blurring, irritability, delayed healing, partial paresis, muscle stiffness and alopecia. These were then prepared for the pretraining of our DBN model.
In the tuning of our model, experimentally selected the values of our parameters and the best for our model and data were identified. Tuning is a crucial stage to avoid fitting problem. For instance, the choice of the number of hidden layers was carefully selected before too small results in underfitting while too large results in overfitting. In this study, we used Rectified Linear Unit (ReLU) as our hidden activation functions, with three hidden layers with 250, 250, 500 as the total number of hidden units in the neural networks, Sigmoid as our input activation function with 20 RBM epochs, 100 batch size and a global learning rate of 0.06.
In our experimental setup, the performance of the deep model was tested in three ways: all features in the original dataset, all qualified (including strongly qualified) features, and the strongly qualified features only.
Table 4 shows the results of our various experiments with our DBN model compared with other classical classification models.
Results obtained in this study show that sex, polyuria, polydipsia, sudden weight loss, weakness, polyphagia, muscle stiffness and alopecia are the strongest indicators in the dataset while age, itching and obesity are deemed by the voting ensemble model to have to significant contribution to the diabetes status of the patients. Also, our proposed DBN model performed best when tested with strongly qualified features ad least when all the sixteen features were used. Although it is not in all cases that deep models would outperform models with one or no hidden layer, however our study showed that the DBN model outperformed the classical classification models in terms of average F1-Measure, recall and precision. This study finds its significance in the fact that the deep learning model developed in this work can assist medics and patients in creating awareness on the early predictors of diabetes mellitus. One rather shocking discovery in this study is the fact that even though diabetes affects older people the more, our feature rankers disqualified it as a possible threat of diabetes. Early detection of diabetes is advantageous in the sense that it can help shape lifestyle, dietary and sleeping patterns. Studies have also shown that early and intensive intervention, not only prevents beta-cell dysfunction but also informs on the potential associated cardiovascular risk factors before reaching the blood glucose thresholds currently set for diagnosing Type II diabetes. It has also been established in literature that early treatment combined with metformin-vildagliptin provides relevant improvements in long-term glycaemic control and can positively affect the disease's progression. Hence, the importance of this study [
49,
50,
51].
V. Conclusion
In this study, we proposed and successfully implemented a deep belief network model, a class of multilayer deep learning models with three RBM layers as the hidden layers. The performance (vis-à-vis F-measure, recall and precision) of the model was tested using data collected using direct questionnaires from the patients of Sylhet Diabetes Hospital in Sylhet, Bangladesh and approved by a doctor. The effectiveness of dimensionality reduction was also measured using voting ensemble feature selection comprising of Mutual Information Gain, Variance Threshold and Chi Square. We also implemented five classical machine learning models to benchmark the performance of our model. The proposed model can be reconstructed and reoptimized for prediction of other forms of diseases using similar dataset.
References
- Preston, E.V., et al., Climate factors and gestational diabetes mellitus risk–a systematic review. Environmental Health, 2020. 19(1): p. 1-19.
- Wang, P., et al., Seasonality of gestational diabetes mellitus and maternal blood glucose levels: evidence from Taiwan. Medicine, 2020. 99(41).
- Boiko, M., R. Ovchinnikova, and A. Shabrina. THE ROLE OF HORMONES IN THE HUMAN BODY. in Чeлoвeк. Oбщecтвo. Kyльтypa. Coциaлизaция. 2019.
- Hauge-Evans, A.C., SUGAR, DOGS, COWS, AND INSULIN—THE STORY OF HOW DIABETES STOPPED BEING DEADLY. Frontiers for young minds, 2021. 9.
- Padhi, S., A.K. Nayak, and A. Behera, Type II diabetes mellitus: A review on recent drug based therapeutics. Biomedicine & Pharmacotherapy, 2020. 131: p. 110708.
- Eizirik, D.L., L. Pasquali, and M. Cnop, Pancreatic β-cells in type 1 and type 2 diabetes mellitus: different pathways to failure. Nature Reviews Endocrinology, 2020. 16(7): p. 349-362.
- 7. Padhi, S., M. Dash, and A. Behera, Nanophytochemicals for the treatment of type II diabetes mellitus: a review. Environmental Chemistry Letters, 2021. 19(6): p. 4349-4373.
- Lee, K.W. , et al., Neonatal outcomes and its association among gestational diabetes mellitus with and without depression, anxiety and stress symptoms in Malaysia: A cross-sectional study. Midwifery, 2020. 81: p. 102586.
- Yang, Q.-Q. , et al., The association between diabetes complications, diabetes distress, and depressive symptoms in patients with type 2 diabetes mellitus. Clinical nursing research, 2021. 30(3): p. 293-301.
- Kopitar, L. , et al., Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Scientific reports, 2020. 10(1): p. 1-12.
- Gadekallu, T.R. , et al., Early detection of diabetic retinopathy using PCA-firefly based deep learning model. Electronics, 2020. 9(2): p. 274.
- Yang, H. , et al., New perspective in diabetic neuropathy: from the periphery to the brain, a call for early detection, and precision medicine. Frontiers in endocrinology, 2020. 10: p. 929.
- Sungheetha, A. and R. Sharma, Design an early detection and classification for diabetic retinopathy by deep feature extraction based convolution neural network. Journal of Trends in Computer Science and Smart technology (TCSST), 2021. 3(02): p. 81-94.
- Tofte, N. , et al., Early detection of diabetic kidney disease by urinary proteomics and subsequent intervention with spironolactone to delay progression (PRIORITY): a prospective observational study and embedded randomised placebo-controlled trial. The lancet Diabetes & endocrinology, 2020. 8(4): p. 301-312.
- Hasan, D.A. , et al. Machine Learning-based Diabetic Retinopathy Early Detection and Classification Systems-A Survey. in 2021 1st Babylon International Conference on Information Technology and Science (BICITS). 2021. IEEE.
- Ben-Israel, D. , et al., The impact of machine learning on patient care: a systematic review. Artificial Intelligence in Medicine, 2020. 103: p. 101785.
- Peiffer-Smadja, N. , et al., Machine learning for clinical decision support in infectious diseases: a narrative review of current applications. Clinical Microbiology and Infection, 2020. 26(5): p. 584-595.
- Bernabe-Ortiz, A. , et al., Diagnostic accuracy of the Finnish Diabetes Risk Score (FINDRISC) for undiagnosed T2DM in Peruvian population. Primary care diabetes, 2018. 12(6): p. 517-525.
- Boulton, A.J. , et al., Comprehensive foot examination and risk assessment: a report of the task force of the foot care interest group of the American Diabetes Association, with endorsement by the American Association of Clinical Endocrinologists. Diabetes care, 2008. 31(8): p. 1679-1685.
- Gray, L. , et al., Implementation of the automated Leicester Practice Risk Score in two diabetes prevention trials provides a high yield of people with abnormal glucose tolerance. Diabetologia, 2012. 55(12): p. 3238-3244.
- Coetzee, A. , et al., The prevalence and risk factors for diabetes mellitus in healthcare workers at Tygerberg hospital, Cape Town, South Africa: a retrospective study. Journal of Endocrinology, Metabolism and Diabetes of South Africa, 2019. 24(3): p. 77–82-77–82.
- El_Jerjawi, N.S. and S.S. Abu-Naser, Diabetes prediction using artificial neural network. International Journal of Advanced Science and Technology, 2018. 121.
- NirmalaDevi, M., S. A. alias Balamurugan, and U. Swathi. An amalgam KNN to predict diabetes mellitus. in 2013 IEEE international conference on emerging trends in computing, communication and nanotechnology (ICECCN). 2013. IEEE.
- Alehegn, M., R. R. Joshi, and P. Mulay, Diabetes Analysis and Prediction Using Random Forest, KNN, Naïve Bayes And J48: An Ensemble Approach. Int. J. Sci. Technol. Res, 2019. 8(9): p. 1346-1354.
- Brown, G.C. , et al., Quality of life associated with diabetes mellitus in an adult population. Journal of Diabetes and its Complications, 2000. 14(1): p. 18-24.
- Tabaei, B.P. and W.H. Herman, A multivariate logistic regression equation to screen for diabetes: development and validation. Diabetes Care, 2002. 25(11): p. 1999-2003.
- Parthiban, G., A. Rajesh, and S. Srivatsa, Diagnosis of heart disease for diabetic patients using naive bayes method. International Journal of Computer Applications, 2011. 24(3): p. 7-11.
- Xu, W. , et al. Risk prediction of type II diabetes based on random forest model. in 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB). 2017. IEEE.
- Al Jarullah, A.A. Decision tree discovery for the diagnosis of type II diabetes. in 2011 International conference on innovations in information technology. 2011. IEEE.
- Kumari, V.A. and R. Chitra, Classification of diabetes disease using support vector machine. International Journal of Engineering Research and Applications, 2013. 3(2): p. 1797-1801.
- Miotto, R. , et al., Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific reports, 2016. 6(1): p. 1-10.
- Pham, T. , et al., Predicting healthcare trajectories from medical records: A deep learning approach. Journal of biomedical informatics, 2017. 69: p. 218-229.
- Tripathi, G. and R. Kumar. Early prediction of diabetes mellitus using machine learning. in 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO). 2020. IEEE.
- Parte, R. , et al., Non-invasive method for diabetes detection using CNN and SVM classifier. International journal of research in engineering, science and management, 2019. 2: p. 659-661.
- Swapna, G., K. Soman, and R. Vinayakumar, Diabetes detection using ecg signals: An overview. Deep Learning Techniques for Biomedical and Health Informatics, 2020: p. 299-327.
- Hu, J. , et al., Raman spectrum classification based on transfer learning by a convolutional neural network: Application to pesticide detection. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 2022. 265: p. 120366.
- Al-Smadi, M. , et al., A transfer learning with deep neural network approach for diabetic retinopathy classification. International Journal of Electrical and Computer Engineering, 2021. 11(4): p. 3492.
- Spänig, S. , et al., The virtual doctor: an interactive clinical-decision-support system based on deep learning for non-invasive prediction of diabetes. Artificial intelligence in medicine, 2019. 100: p. 101706.
- Nguyen, B.P. , et al., Predicting the onset of type 2 diabetes using wide and deep learning with electronic health records. Computer methods and programs in biomedicine, 2019. 182: p. 105055.
- Ryu, K.S. , et al., A deep learning model for estimation of patients with undiagnosed diabetes. Applied Sciences, 2020. 10(1): p. 421.
- Prabhu, P. and S. Selvabharathi. Deep belief neural network model for prediction of diabetes mellitus. in 2019 3rd international conference on imaging, signal processing and communication (ICISPC). 2019. IEEE.
- Zelaya, C.V.G. Towards explaining the effects of data preprocessing on machine learning. in 2019 IEEE 35th international conference on data engineering (ICDE). 2019. IEEE.
- Deshmukh, D.H., T. Ghorpade, and P. Padiya. Improving classification using preprocessing and machine learning algorithms on NSL-KDD dataset. in 2015 International Conference on Communication, Information & Computing Technology (ICCICT). 2015. IEEE.
- Patro, S. and K.K. Sahu, Normalization: A preprocessing stage. arXiv preprint. arXiv:1503.06462, 2015.
- Saeys, Y., T. Abeel, and Y. Van de Peer. Robust feature selection using ensemble feature selection techniques. in Joint European conference on machine learning and knowledge discovery in databases. 2008. Springer.
- Seijo-Pardo, B. , et al., Ensemble feature selection: homogeneous and heterogeneous approaches. Knowledge-Based Systems, 2017. 118: p. 124-139.
- Zhang, N. , et al., An overview on restricted Boltzmann machines. Neurocomputing, 2018. 275: p. 1186-1199.
- Bonett, D.G. and T.A. Wright, Sample size requirements for estimating Pearson, Kendall and Spearman correlations. Psychometrika, 2000. 65(1): p. 23-28.
- Gómez-Peralta, F. , et al., When does diabetes start? Early detection and intervention in type 2 diabetes mellitus. Revista Clínica Española (English Edition), 2020. 220(5): p. 305-314.
- Gilmer, T.P. and P.J. O'Connor, The growing importance of diabetes screening. 2010, Am Diabetes Assoc. p. 1695-1697.
- Sabariah, M.M.K., S. A. Hanifa, and M.S. Sa'adah. Early detection of type II Diabetes Mellitus with random forest and classification and regression tree (CART). in 2014 International Conference of Advanced Informatics: Concept, Theory and Application (ICAICTA). 2014. IEEE.
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).