1. Introduction
Type 2 diabetes mellitus (T2DM) is a chronic metabolic disorder that affects millions of people worldwide, posing significant health and economic burdens. Early detection and prevention of T2DM are crucial for reducing its complications and improving the quality of life of patients [
1]. Predicting diabetes allows for earlier detection and intervention, potentially delaying or preventing disease progression. This aligns with personalized medicine’s emphasis on proactive healthcare. However, the current diagnostic methods for T2DM, such as the oral glucose tolerance test (OGTT) and the glycated hemoglobin (A1C) test, can be invasive, costly, and time-consuming.
Phenotypic data can provide valuable insights into the risk factors and pathophysiology of T2DM. Phenotypic data include anthropometric measurements, biochemical markers, lifestyle habits, medical history, and family history. Machine learning (ML) techniques, which are computational methods that learn from data and make predictions, can leverage phenotypic data to build predictive models for T2DM [
2]. ML techniques have several advantages over conventional statistical methods, such as the ability to handle high-dimensional and nonlinear data, discover complex patterns and interactions, and improve accuracy and generalization [
3,
4,
5].
Several studies have applied different ML techniques to predict T2DM using phenotypic data from various populations. For example, Yu et al. (2010) used a support vector machine (SVM) to classify instances of diabetes using data from the National Health and Nutrition Examination Survey (NHANES) [
6]. Deberneh and Kim utilized five ML algorithms for the prediction of T2DM using both laboratory results and phenotypic variables [
2]. Anderson et al. used a reverse engineering and forward simulation (REFS) analytical platform that relies on a Bayesian scoring algorithm to create prediction-model ensembles for progression to prediabetes or T2DM in a large population [
7]. Cahn et al. used electronic medical records (EMR) data from The Health Improvement Network (THIN) database that represents the UK population to identify prediabetic individuals [
8]. Shin et al. used logistic regression (LR), decision tree, random forest (RF), eXtreme Gradient Boosting (XGBoost), Cox regression, and XGBoost Survival Embedding (XGBSE) for prediction of diabetes [
9]. Gul et al. investigated the prediction value of phenotypic variables (body mass index, cholesterol, familial diabetes history, and high blood pressure) with LR [
10]. Dinh et al. achieved high area under curve (AUC) scores with and without laboratory data with LR, SVM, and three ensemble models (RF, gradient boosting-XGBoost, and a weighted ensemble model) [
11]. Viloria et al. used SVM to predict T2DM only using body mass index (BMI) and blood glucose concentration [
12]. Wang et al. compared XGBoost, SVM, RF, and K-nearest neighbor (K-NN) algorithms to predict the risk of T2DM [
13].
Previous studies have shown that ML techniques can be used to predict T2DM using phenotypic data. However, these studies have several limitations. First, they focused on a few ML algorithms or compared them in isolation without considering the full range of existing methods or novel algorithms. Second, some of these studies used laboratory results that are directly related to glucose metabolism and may not reflect other aspects of phenotypic variation. Another limitation of these studies is their lack of consideration for gender disparities in risk factors, disease outcomes, and model performance for T2DM. Additionally, some studies relied on datasets that were either small or of questionable quality [
13]. Despite these limitations, the previous studies provide promising evidence that ML techniques can be used to develop accurate and robust models for predicting T2DM risk.
This study aimed to address the limitations of previous studies by using PyCaret, an open-source, low-code ML library in Python that automates ML workflows [
14]. PyCaret allows simultaneous evaluation of multiple ML algorithms, including newly developed ones, such as XGBoost, LightGBM, and CatBoost for predicting T2DM. PyCaret eliminates the need for extensive domain expertise, reduces the analysis time, and allows to obtain more comprehensive evaluation metrics.
Another aim of the research was to explore the differences between male and female populations for predicting T2DM. By analyzing how phenotype and gender interact, we can identify risk factors that are more prominent in one sex compared to the other. This knowledge allows healthcare providers to tailor screening and prevention personalized strategies for men and women.
2. Materials and Methods
2.1. Dataset Overview
Raw data from controls and patients were obtained from the Nurses’ Health Study (NHS), an all-female cohort, and the Health Professionals Follow-up Study (HPFS), an all-male cohort. Data are available at the database of Genotypes and Phenotypes (dbGaP) under accession phs000091.v2.p1 and were obtained with permission [
15].
NHS and HPFS are well-established cohorts that are part of the Genes and Environment Initiatives (GENEVA). In addition to investigating the genetic factors contributing to the development of T2DM, they also aim to explore the role of environmental exposures. These cohorts offer a resource for studying the genetic and environmental factors associated with T2DM. Participants in both cohorts completed comprehensive mailed questionnaires regarding their medical history and lifestyle.
2.2. Variables
The variables used for analysis in this study include:
Medical history variables: family history of diabetes among first-degree relatives (famdb), reported high blood pressure (hbp), reported high blood cholesterol at/before blood draw (chol)
Intake variables: alcohol intake (alcohol)
Nutrient variables: heme iron intake (heme), magnesium intake (magn), cereal fiber intake (ceraf), polyunsaturated fat intake (pufa), trans fat intake (trans), glycemic load (gl)
Lifestyle variables: cigarette smoking (smk), exercise habits (total physical activity, act)
Body measurements: body mass index (BMI or bmi)
Gender (for total data)
Categorical variables are used as they are, with no numeric conversion applied for analysis.
2.3. Data preprocessing
The data contains information about the disease status of a total of 6,033 individuals: 3,429 females (NHS) and 2,604 males (HPFS). The analysis focused on white individuals. Participants of other races (158), Hispanic (37), those with other types of diabetes (133), individuals without genotype ID (25), and first-degree relatives (8) of the participants were excluded. The characteristics of the remaining 5,672 individuals, which includes both controls and T2DM individuals, are presented in
Table 1.
2.4. PyCaret Analysis
PyCaret was used to analyze a dataset of 5,672 individuals with 14 phenotypic variables for the male and female datasets and 15 variables, including gender, in the total dataset. The dataset was divided into male and female subsets, and the performance and features of 16 machine learning (ML) classification algorithms were compared for each subset.
PyCaret analysis was performed in the following order: importing the necessary libraries and the data (as a .csv file), then preprocessing the data using PyCaret, displaying data features (numeric or categorical), and handling missing data. Since the missing values accounted for 3.5% of the total dataset, with a maximum rate of 1.5% in individual features, no data points were dropped. PyCaret used a simple imputation method to address the missing data.
Next, the classification command was used to compare the available classification algorithms. Once the results were available, the best algorithm was selected and performed 10-fold cross-validation for tuning the model’s hyperparameters. The model’s performance and robustness were evaluated using stratified 10-fold cross-validation with a train-test split ratio of 70:30. The analysis was implemented using the Anaconda Navigator IDE with Python in the Jupyter Notebook (version 6.5.2) editor, along with PyCaret (version 3.0.0), on a 64-bit Windows 11 computer.
The analysis provided the following evaluation metrics: accuracy, area under curve (AUC), recall, precision, F1-score, kappa score, Matthews Correlation Coefficient (MCC), and analysis time (TT). A variable importance graph was also produced. SHAP values graph was generated using subsequent commands.
Accuracy is the percentage of correct predictions out of all predictions. Precision is the percentage of correct positive predictions out of all positive predictions. Recall is the percentage of correct positive predictions out of all actual positives. The F1-score is a harmonic mean of precision and recall that balances both metrics. AUC (area under the curve) is a measure of how well a model can rank positive and negative examples correctly. The kappa score is a measure of agreement between the model’s predictions and the actual labels. MCC (Matthews correlation coefficient) is a balanced measure of performance that takes into account true positives, true negatives, false positives, and false negatives. The formulas for these metrics are:
2.5. Statistical Analysis
SPSS software (version 28.0.0.0; SPSS Inc., Chicago, IL, USA) was used to construct the tables and Real Statistics Resource Pack for Excel (version 8.8.2) [
16] to analyze the numerical data, as SPSS does not provide more than 36 decimals or exact p-values for very low p-values. A chi-square test was used to examine the association between categorical variables using Chi-Square Test Calculator [
17]. The independent samples t-test was used to compare the means of numerical variables between groups. Numerical variables were presented as means ± standard deviations (SD) and categorical variables as frequencies and percentages. The significance level was set at p < 0.05.
4. Discussion
ML algorithms, which are a set of rules that do not require explicit programming, allow computers to learn and make predictions from data. They are used for a variety of applications, such as natural language processing, image recognition, fraud detection, and disease prediction [
19]. In the past, the use of most prevalent ML algorithms simultaneously was infeasible due to the need for programming expertise and time-consuming testing. However, developments in information technologies have made it possible to run ML algorithms without knowing much code and to analyze many ML algorithms simultaneously.
Machine learning and statistical analysis were applied in the current study to investigate the factors associated with T2DM in the NHS and HPFS datasets. PyCaret classification analysis [
14] was used as a machine learning tool to run 16 classification algorithms simultaneously with minimal coding to predict T2DM. The performance of the models was evaluated using various metrics, such as accuracy, AUC, recall, precision, F1-score, kappa, MCC, and TT (Sec). Feature importance plots and SHAP values were used to interpret the models and identify the most relevant features for the prediction.
The ridge classifier, LDA, and LR exhibited the best performance among models for the male-only data subset, all achieving similar scores. In contrast, for the female-only data subset LR, ridge, and LDA were the top-performing models, also with similar scores. In the total data subset, LR, GBC, and CatBoost classifier emerged as the best-performing models demonstrating comparable scores.
The feature importance plot, one of the most commonly used explanation tools in machine learning analysis, was also utilized [
20]. This tool shows how much each feature contributes to the prediction model, based on the change in accuracy or error when the feature is removed or shuffled [
21]. The higher the variable importance, the more important the feature is for the model. However, this tool does not imply any causal relationship between the features and the outcome, as there may be other factors or interactions that affect the model. A feature may have a high statistical significance, but a low variable importance, if it does not improve the prediction model much.
The feature importance plot aids in understanding relevant features for the prediction model and identifying irrelevant or redundant ones. Additionally, it assists in selecting or eliminating features to enhance or simplify the model. However, it should be noted that the feature importance plot may vary depending on the type of machine learning technique, the dataset, and the evaluation metric used for measuring the model’s performance. Therefore, it should be interpreted with caution and in conjunction with other methods of machine learning analysis. The feature importance plot showed that features had different importance values for the prediction model in different data subsets. The most important features were “famdb”, “smoker_never”, and “hbp” in the female-only data subset, and “famdb”, “hbp”, and “smoker_current” in the male-only data subset. Furthermore, the order of the variables and their values differ across genders.
SHAP values correlation analysis was also performed for variables with T2DM using PyCaret analysis. SHAP values show how each feature affects the model output. It can be a useful visual tool for understanding how machine learning models work and for identifying the most important features for a given prediction task. The SHAP values plot showed the distribution of impacts that each feature had on the model output. The color represented the value of the feature, with red being high and blue being low. The higher the SHAP value was (x-axis), the higher the likelihood of positive class. The SHAP values plot also showed some interactions between features that affected the model output.
Dinh et al. used 123 variables for 1999-2014, and 168 variables for 2003-2014, including survey questionnaire and laboratory results [
11]. They found that AUC was 73.7% and 84.4% without and with the laboratory data for prediabetic individuals, respectively. In the other study, Lai et al. found that the GBM and Logistic Regression model performed better than the Random Forest and Decision Tree models. However, they used several laboratory data, such as fasting blood glucose, in their models as well as BMI, high-density lipoprotein (HDL), and triglycerides as the most important predictors [
22]. The well-known Framingham Diabetes Risk Scoring Model (FDRSM) is a simple clinical model that uses eight factors: gender, age, fasting blood glucose, blood pressure, triglycerides, HDL, BMI, and parental history of diabetes; in order to predict the 8-year risk for developing diabetes by logistic regression models [
23,
24]. They also used 2-hour post-OGTT in complex clinical models [
24]. While AUC was 0.72 in the simple model, it increased to 0.85 in the complex clinical model. The use of either blood glucose levels or OGTT in model creation clearly has a significant impact on model performance alone. However, it is crucial to establish the predictive effectiveness of variables prior to the onset of elevated blood glucose levels. An AUC of 0.79 was obtained in ML analysis conducted on the total dataset in the current study. The relatively lower performance can be attributed to the use of fewer variables, especially phenotypic variables instead of direct glucose or OGTT measurements, and less laboratory data in the current models. Moreover, the current approach employed a broader range of ML algorithms compared to previous studies, enabling a comprehensive comparison and selection of the most effective methodologies. Utilizing PyCaret for ML algorithms streamlined the process by automating data processing and model evaluation steps. Therefore, differences between this and other studies primarily stem from variations in the nature of the data.
An in-depth investigation into the predictive potential of phenotypic variables for T2DM was conducted. To complement the current ML approach, a rigorous statistical analysis was undertaken to assess the inferential strength of each individual variable. Statistical analysis tests hypotheses and infers causal relationships between individual variables and diabetes risk, while ML builds models and makes predictions based on the data, without necessarily explaining how the data is related or what causes the outcome [
25]. However, statistical methods do not capture the nonlinear and interactive effects of multiple variables on diabetes risk. In contrast, ML algorithms can uncover intricate patterns and interactions within the data, which are not evident through conventional statistical measures alone. Therefore, the complementary strengths of both statistics and ML were leveraged to provide a more comprehensive understanding of the predictive factors for T2DM, allowing for prioritization of variables that may have been overlooked by traditional statistical analysis. As Bennett et al. suggest, ML and statistical analysis are different but complementary methods that can provide different insights into the data, depending on the research question and the data available [
25].
Furthermore, gender differences in diabetes risk and outcomes have been extensively reviewed in previous studies [
26]. However, most of these investigations have primarily focused on the role of sex hormones, sex chromosomes, or sex-specific environmental factors in explaining these disparities. This study aimed to investigate whether phenotypic variables, including BMI, blood pressure, and lipid levels, demonstrate distinct predictive patterns for diabetes risk among males and females. Leveraging ML techniques, a substantial dataset comprising phenotypic variables was analyzed from individuals with and without T2DM. The current findings revealed that certain phenotypic variables displayed varying degrees of predictive power for diabetes risk across genders. Notably, variables like famdb, hbp, and chol exhibited higher feature importance scores for females than for males, whereas heme emerged as a more significant predictor for men compared to women if we disregard smoking status in the modeling. Furthermore, the feature importance analysis also revealed gender as one of the significant factors in the overall dataset. These results suggest that phenotypic variables can capture certain facets of sex-specific pathophysiological aspects of diabetes and have the potential to enhance the accuracy and personalization of diabetes risk prediction models. Further studies are needed to validate the current findings and delve into the underlying mechanisms contributing to the sex differences observed in phenotypic-based diabetes prediction.