1. Introduction
With the ongoing development in information and communication technologies, unprecedented amounts of data have been generated in multiple fields of healthcare [
1]. Data-driven models offer great research opportunities, allowing us to extract clinical knowledge and support decision-making. Along with data proliferation, the use of Artificial Intelligence (AI) has been intensified in recent years fostered by advances in computer processing, software platforms and automatic differentiation [
2]. Within AI, Machine Learning (ML) methods have attracted significant attention in both academia and industry, being used in multiple domains ranging from computer vision to natural language processing [
3] and for different tasks such as classification, regression, and clustering among others.
Despite the potential of ML methods, most of them are generally hampered by the class imbalance problem, which occurs when the proportion of samples of one class greatly outnumbers the others [
4,
5]. Since most ML algorithms are built to work with balanced datasets, the classifiers are biased toward the majority class. To deal with this, several methods have been proposed in the literature [
4,
5,
6,
7], which can be classified into two types, algorithmic-level and data-level approaches. The former adapts the loss function of the algorithm by assigning a higher weight to the misclassification of samples associated with the minority classes during the training process [
8]. Examples of these approaches include cost-sensitive learning and ensemble methods [
8]. In contrast, data-level approaches balance the class distribution by undersampling the majority class, oversampling the minority classes, or considering a hybrid approach that combines undersampling and oversampling approaches [
9].
In this paper, we primarily focus on oversampling techniques. Among them, Synthetic Minority Oversampling Technique (SMOTE) [
10] is one of the most used. SMOTE relies on the algorithm of nearest neighbors that aims to generate new samples that are mid-way between two near neighbors in any particular class. SMOTE has been used to generate numerical data and improve the generalization of predictive models in tasks such as regression and classification [
11,
12,
13]. However, many real-world applications present high-dimensional and heterogeneous data (mixed-type) with numerical and categorical features. SMOTEN, a variant of SMOTE for categorical data has been used in various applications [
10], however, the quality of generated synthetic data is not the best [
14,
15,
16].
Recently, generative models based on Artificial Neural Networks (ANNs) have revolutionized the outcomes in multiple knowledge areas due to their outstanding performance for creating synthetic data, particularly in computer vision and image applications [
17]. Despite these outcomes, a few strategies have been studied for generating tabular (structured) data. For instance, a variant of the Variational Autoencoder (VAE) called Tabular VAE (TVAE) has been proposed, which uses two ANNs and trains using evidence lower-bound loss with the goal of creating mixed-type data [
18]. Also, the techniques based on Generative Adversarial Networks (GANs) emerge as a potential tool for creating synthetic data, frequently enhancing the model’s performance in classification tasks, while also addressing data privacy issues. Although the application of GANs has been validated in different domains [
17], they have not been well studied when considering Electronic Health Records (EHRs) with structured and categorical and continuous data [
19]. Because tabular data typically contains a mix of categorical and continuous features, generating realistic synthetic data is not an easy task. In this sense, Conditional Tabular GANs (CTGANs) have been created for modeling tabular data distributions and sampling entries from them [
18] by employing a conditional generative antagonistic network. Furthermore, in several real-world datasets, CTGAN has outperformed Bayesian approaches [
18].
In healthcare applications, class imbalance is a recurrent challenge to build predictive models with a reasonable generalization capacity, because a highly skewed distribution of training data will be prone to force the learning algorithm biased towards the majority class. In particular, we focus our research on Cardiovascular Diseases (CVDs) since it is the most significant cause of death worldwide [
20]. Specifically, we analyze data collected by smartphone-based method from a population group in Norway [
21]. The dataset comprises a series of survey questions related to socioeconomic factors, alcohol, and drug use, Physical Activity (PA), dietary intake, and one question indicating current/previous non-communicable diseases. Working with categorical features in ML is challenging due to most algorithms work adequately with numerical data, being one-hot encoding one of the most popular approaches to transform categories into numbers [
22]. However, this approach generally returns a sparse matrix, increasing the amount of features (dimensions) that the model handles, and the risk of the curse of dimensionality problem [
23]. Furthermore, when the feature includes an excessive number of categories, the majority of which are irrelevant for the prediction, this is amplified. To cope with these issues, we applied a target encoding strategy [
22]. The goal of this approach to encode the categories by substituting them for a measurement of the impact they may have on the target.
This research aims to perform a comparative study of synthetic categorical data generation methods, with a special focus on GAN-based models. To this end, we have generated new samples with oversampling methods that seek to maintain the same feature categories as the original data, similar probability distribution of attributes, and the dependence between them, thus addressing the problem of data imbalance. All of this enables enhancing the effectiveness and accuracy of the developed classifiers. However, some ML methods use nonlinear transformations, leading to a lack of interpretability and creating
black-box models [
24]. Interpretation is defined as the process of generating human-understandable explanations of outcomes provided by computational models [
24,
25]. Several approaches have been proposed for gaining interpretability for improving model understandability and reliability, highlighting
model-specific and
model-agnostic methods [
26]. The former is based on feature weighting, which seeks to identify the contributions of the features that determine the predictions of ML models. Although feature weighting is easy to apply to simple linear models, these models tend to have limited predictive performance and, therefore, have also limited interpretive relevance. The second approach outlined above, the model-agnostic approach, appears to address this limitation, which aims to extract post-hoc explanations by treating the original model as a black box.
Concerning the model interpretability, among the most popular interpretable models, the generalized linear and tree-based models are of great value to interpreting model predictions [
27,
28]. In this work, two linear models were considered: Least Absolute Shrinkage and Selection Operator (LASSO) [
29] and Linear Support Vector Machine (SVM) [
30]. The goal was to extract the most relevant features by analyzing the weights of the coefficients of each of the features to give information about their significance for predicting the output class. As a nonlinear model, a Decision Tree (DT) was considered since it provides the importance of each feature [
31]. Additionally, the inherent characteristics of various ML models, due to nonlinear transformations in the learning process, make them powerful in terms of predictive performance, but they lack interpretability. In the case of the nonlinear ML classifier such as K-Nearest Neighbors (KNN) [
32], we focus on post-hoc interpretability called Shapley Additive Explanations (SHAP) [
33], which is founded on game theory and local explanations. Since SHAP provides the contribution of each feature in the model’s output, it can be considered a tool for model interpretability.
We summarize next our main contributions: (i) a comparative of different resampling and neural network generative models, highlighting oversampling techniques, for generating categorical data and their influence in the binary classification performance, and (ii) a methodology to interpret the more representative risk factors/features for identifying CVD subjects by using a dataset composed by sociodemographic, lifestyle and clinical categorical variables.
The remaining article is organized as follows.
Section 2 describes the dataset and the pre-processing method. Also, we present the foundations of categorical encoding techniques, and the resampling techniques for addressing the imbalance learning problem.
Section 3 shows the experimental setup, classification performance, and model interpretability outcomes of the linear and nonlinear models that were considered. Finally, the discussion and conclusions are presented in
Section 4 and
Section 5, respectively.
3. Results
In this section, we analyze the impact of combining different oversampling and ML classifiers in a binary classification scenario for healthy individuals and CVD subjects. The experimental setup, and then the figures of merit obtained are presented, including a comparison of classification performance by using all features and those selected by FS. Finally, post-hoc model interpretability based on the importance of the features is conducted.
3.1. Experimental setup
The dataset was randomly split into training (80%) and test (20%) subsets (see
Figure 2), and five independent training and test partitions were considered to further evaluate the performance of the model. The training subset was used for the model design, while the test subset was used to evaluate the model performance (
i.e., to evaluate its generalization capacity). Bootstrap resampling was used to remove those features that were non-relevant and uninformative for predicting the target variable. Using this method,
features were selected from the
initial study features. Before training the classifiers, different values of the smoothing parameter
w (between
) were investigated to ensure proper use of the target encoding technique selected, addressing the over-fitting issue raised by target-agnostic approaches. The AUC values were used to select the best
w for our dataset. Experimental results showed that lower values of
w offer better AUC values, and consequently were more suitable for the binary classification scenario. In this paper, a regularization smoothing parameter of
was chosen for subsequent analyses.
3.2. Quality evaluation of synthetic clinical data
Several data quality metrics were considered to evaluate the similarity between synthetic and real data according to different IRs. As stated in
Section 2.2.4, the PMFs associated with the different features are estimated as the previous step for computing these metrics. In
Figure 3, the PMFs associated with age, BMI, sex, and high cholesterol are depicted. The panels in the first column of the figure show the PMFs of real data, and the remaining represent the PMFs obtained from synthetic data using SMOTEN, TVAE and CTGAN. We aim to measure the similarity of the PMF of real data and those obtained with synthetic data, comparing how well the PMF is learned by the different oversampling methods considered.
Remarkably, the PMFs of data generated with TVAE do not follow the probability distribution of the PMF of real data, showing a lack of probability values in certain categories. For instance, if we look at the age feature (see
Figure 3(a)), in the PMF of synthetic data obtained with TVAE (third column), there are categories without probability values, specifically for 18-20, 30-39 and NA. In the same manner, for BMI feature (see
Figure 3(b)), the categories HW, NA, OBC-II and OBC-III do not have values. Additionally, in some categories, the probability is much higher than those obtained in real PMF (see OW in BMI feature). TVAE is the method that worst mimics the distribution for the four features considered. This further points out the low performance of TVAE for replicating samples of categorical data. By analyzing SMOTEN, we find similar insights that in TVAE. There are categories in the PMF without values (see panels (a) and (b) associated with age and BMI). Regarding CTGAN, the PMFs obtained in the data generated are quite similar to real ones. All categories have probability values and these are similar to the observed in the real PMF. The insights drawn from these figures allow us to understand how similar are the synthetic and real data in terms of univariate attribute fidelity, being CTGAN which emulates more precisely the real distributions.
The next step is to analyze the data quality metrics considering different IR values (see
Figure 4). We observed that in terms of KLD, HD, MAEP, and RSVR (panels (a), (b), (c) and (d), respectively), CTGAN showed the best results. Note that lower values in these metrics indicate more similar distribution probabilities associated with features. Regarding PCD (
Figure 4(e)), which measures whether the correlations between features are preserved, CTGAN also reached the highest values compared to SMOTEN and TVAE. For LCM, where the lower values indicate fewer differences in the distribution of
and
, it was TVAE that reached the optimal values. Note that CTGAN obtained the second one with the lowest values.
Table 2 presents a summary of the quality metrics that analyzes:
(i) the similarity between the features of
and
(KLD, HD, and MAEP) and
(ii) the relationships between features captured (PCD and LCM). Note that these metrics were obtained considering
and 5 different training partitions, and we present the mean and the standard deviation (std). As previously mentioned, CTGAN more accurately simulated true distributions, achieving the best KLD, HD, and MAEP values. Regarding PCD, which measures how well the correlations between features are captured, CTGAN also reached the best performance (the highest value). It is in LCM where TVAE reached a better performance (lowest values). Therefore, we conclude the potential of CTGAN for generating categorical synthetic data.
3.3. Classification performance
This section presents the classification results provided by linear (LASSO and SVM) and nonlinear (DT and KNN) classifiers using SMOTEN, TVAE, and CTGAN and considering different resampling strategies.
In
Figure 5, we show the mean and std of sensitivity and AUC values on 5 test subset partitions and considering different classifiers (LASSO, SVM, DT and KNN) and oversampling strategies. It can be seen that there is a direct relationship between the IR and the model performance in terms of sensitivity and AUC, indicating that the higher the number of synthetic samples generated, the better the performance of ML models. By analyzing
Figure 5(a), (c) and (e), it can be observed that the sensitivity values present high variability for the DT model and all oversampling methods. On the contrary, the linear models LASSO and SVM present lower variability (low std), being the most robust classifiers. For small IR values (0.4), the AUC is around 0.6 for all models (see
Figure 5 (b), (d) and (f)), achieving better performance when the IR is increased. The highest AUC values are obtained when CTGAN linear models are considered (see
Figure 5(f)).
In general, the best figures of merits are obtained when applying linear models and considering SMOTEN and CTGAN techniques. The results obtained for TVAE are slightly different, where higher values are obtained with the nonlinear KNN model than with the linear models, specifically, when IR increases. We can conclude that the best performance is obtained when considering CTGAN and IR=1.0, i.e., when the number of samples of minority and majority classes is the same.
Next, we show in
Figure 6, the sensitivity and AUC values when using different classifiers and the hybrid resampling approach. Two main insights can be drawn. Firstly, in general terms and referring to the six panels, it can be observed that the increase in the amount of IR (from 0.5 to 1.0) which refers to the increase in the generation of the number of samples of the minority class, does not impact on the classification measures (neither for sensitivity nor for AUC). For this approach, generating high number of synthetic samples from the minority class and training the ML models with a more substantial number of synthetic data does not imply obtaining better predictive performance. This indicates that after a given percentage of IR, this technique has learned the data distribution, and no matter how many additional samples are added, the distribution remains unchanged. To remark that the predictive performance of the linear models (LASSO and SVM) using SMOTEN and CTGAN, both in terms of sensitivity and AUC is greater than the obtained values using nonlinear models (DT and KNN).
Table 3 summarizes the best sensitivity and AUC values using different resampling methods and class balancing strategies. These results are presented considering all variables (
) and only using the selected variables (
) with the bootstrap resampling method. Comparing the figure of merit obtained using all features and only using those selected by the bootstrap resampling method, it is observed that the models using the previously selected features show a slight improvement in the classification performance of most of the models. Mainly, this improvement can be seen when considering TVAE and the hybrid technique (AUC value of 0.66 with all variables versus 0.70 with FS). However, the best performance is obtained when CTGAN and the hybrid strategy are considered for both, with and without FS. Finally, CTGAN provides better performance than when an undersampling approach is considered. This is crucial because it highlights the benefit of this way of creating synthetic samples, which, in addition to producing high-quality synthetic data, also improves the performance of the classifiers compared to the undersampling strategy, which only considers real data.
3.4. Analyzing risk factors using interpretability methods
Figure 7 shows the values of coefficients assigned to each feature when training the linear models with different oversampling methods (SMOTEN, TVAE, and CTGAN), with an oversampling class balancing strategy and an IR ranging from 0.5 to 1.0 (this IR refers to the increase only in the number of samples of the minority class until the number of samples of the majority class is reached).
In
Figure 7, we can observe that as the IR goes from 0.5 to 1.0, the values of the coefficients associated with each feature do not change much. As a result, we conclude that the increasing fraction of synthetic samples generated has not much impact on the fact that some features are more relevant to predict CVD. It can also be shown that while dealing with the LASSO model and the DT model for each specific oversampling approach, the coefficient values are similar, with many of them being zero in both models, nullifying the impact of that variable. However, the value of the coefficients fluctuates between positive and negative values for the SVM model (middle panels), with positive values indicating a greater influence of the variable on the prediction of subjects with CVD and negative values indicating a greater influence of the variable on the prediction of healthy individuals.
Note also that the weight (coefficient) assigned to each variable varies depending on the oversampling strategy used. This means that the considered method to generate synthetic samples could influence the features that are deemed most important to decide the final classification. In this sense, when applying SMOTEN and CTGAN, and especially when comparing LASSO and DT, the pattern of the coefficient values is comparable, with age, BMI, high cholesterol, and sex being the most significant factors in predicting CVD. TVAE, on the other hand, exhibits a somewhat different pattern; in this case, while age remains the most important predictor in predicting CVD, it is followed by preprocessed meat, intense activity, and BMI.
We focus now on SHAP, a post-hoc interpretability approach used for noninterpretable ML models.
Figure 8 shows the SHAP summary plot when training the nonlinear model KNN with the CTGAN oversampling method, the oversampling class balancing strategy, and
. The summary plot provides different information concerning the interpretability of the model. It presents the importance of the features as well as their impact on the prediction. Each point represented in the summary plot is a Shapley value for a specific feature and a specific sample. The position on the x-axis shows the Shapley value assigned to each sample whereas the y-axis shows the relevance of the features in descending order. For our model, we can see how the three most relevant variables in the prediction are the presence of high cholesterol, sex, and age, all of them coinciding with the interpretability results obtained for the linear models. As for the colors, these represent the value of the feature from lowest (blue) to highest (red). Thus, observing the most relevant variables, we can conclude that higher values of cholesterol, sex, and age are related to the prediction of subjects with CVD and lower values of these variables are associated with the prediction of healthy individuals. On the contrary, for example, in the case of the alcohol variable, the opposite occurs since higher values of this variable are relevant in predicting healthy individuals, and lower values are related to predicting subjects with CVD.
4. Discussion
In this work, different resampling methods were used, highlighting the use of CTGANs as an oversampling technique to generate synthetic data for achieving data balance among different classes in a classification problem. In particular, a dataset with real-world data from healthy individuals and individuals with CVD was employed to carry out the synthetic data generation. The results from different metrics for quality assessment of the generated synthetic datasets were analyzed and discussed.
These results demonstrated the high potential of CTGANs for generating categorical data, keeping relevant information, and improving classification performance. The following five findings are particularly important to this study. Firstly, the GAN-based model generates high-quality synthetic data, yielding PMF that was highly close to the PMF of real data. This makes a significant contribution to the literature, demonstrating the potential of GANs in the clinical setting [
18]. Secondly, the use of an oversampling strategy, instead of a hybrid technique, improves sensitivity and AUC scores in all ML models, due to the quantity of IR increases. This is an important contribution to the literature since oversampling outperforms undersampling approaches regarding ML classifiers’ performance. Thirdly, it has been demonstrated that linear classifiers outperform nonlinear ones when using target encoding. Fourthly, the combination of the GAN and LASSO-based models yields an AUC value of 71% (an improvement of 2% with respect to other oversampling strategies), improving 1% the results achieved with undersampling. Finally, because of the interpretive capabilities of our models, the findings of this study may help to improve both the prediction and prevention of CVD, and the knowledge of the risk factors related to CVD. The most important risk factors for CVD prediction identified in all models were the presence of high cholesterol, age, BMI, and sex.
These findings are consistent with the primary CVD risk factors identified in the literature [
53,
54,
55]. According to [
53], BMI is one of the most critical criteria to consider since excessive adiposity is a significant risk factor for morbidity and death from type 2 diabetes, CVD, and different types of cancer. On the one hand, individuals from high-income countries are more likely to consume healthy foods, according to to [
56]. On the other hand, low-income people tend to consume more fat and less fiber, which explains the average importance of characteristics like fish or meat consumption in the prediction of CVD. Furthermore, individuals in low and middle-income countries are more likely to drink alcohol and smoke than socioeconomic groups in high-income countries, which explains the importance of these features to predict CVD, according to [
57].
Despite the potential benefits of using clinical data for research, these data are highly sensitive, and their use is restricted by privacy legislation and organizational guidelines [
58]. Additionally, patient data are regulated by laws protecting patients’ privacy such as the
Health Insurance Portability and Accountability Act in the United States and the
General Data Protection Regulation in the European Union [
59]. Sharing of public health data has always been hampered by privacy concerns. Furthermore, in the clinical setting, most of the populations studied are commonly unbalanced, with the class of patients with a certain disease typically being smaller than the class of healthy individuals. In this way, synthetic data could allow researchers to delve deeper into complex medical issues, eliminating challenges such as a lack of access to protected data and addressing the issue of class imbalance [
42].
Further work will assess the findings achieved in this work by employing several different real-world clinical and nonclinical datasets. Furthermore, the quality of the synthetic data could be assessed by designing specific classifiers able to provide discrimination between real and synthetic data, evaluating in this way the performance of the oversampling techniques. Finally, other FS techniques could be studied in order to confirm the findings of this work related to the most relevant features selected to improve the performance and the interpretability of the ML models.
Author Contributions
Conceptualization, I.M.-J. and C.S.-R.; methodology, I.M.-J. and C.S.-R.; software, C.G.-V. and D.C.-M.; validation, I.M.-J., H.F. and C.S.-R.; formal analysis, I.M.-J. and C.S.-R.; investigation, C.G.-V. and D.C.-M.; resources, I.T.G., C.G. and C.S.-R.; data curation, I.T.G. and M.-L.L.; writing—original draft preparation, C.G.-V., D.C.-M., H.F. and C.S.-R.; writing—review and editing, I.M.-J., I.T.G., M.-L.L., C.G. and C.S.-R.; visualization, C.G.-V. and D.C.-M.; supervision, I.M.-J. and C.S.-R.; project administration, C.G. and C.S.-R.; funding acquisition, C.S.-R. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Profiles for healthy and CVD individuals considering: (a) background; (b) substance use; (c) PA; (d) dietary intake; and (e) income features.
Figure 1.
Profiles for healthy and CVD individuals considering: (a) background; (b) substance use; (c) PA; (d) dietary intake; and (e) income features.
Figure 2.
Proposed workflow using oversampling techniques (SMOTEN, TVAE, CTGAN) and different ML classifiers.
Figure 2.
Proposed workflow using oversampling techniques (SMOTEN, TVAE, CTGAN) and different ML classifiers.
Figure 3.
PMFs associated with: (a) age; (b) BMI; (c) sex; and (d) high cholesterol. PMF obtained with real data (first column); SMOTEN (second column); TVAE (third column); and CTGAN (fourth column).
Figure 3.
PMFs associated with: (a) age; (b) BMI; (c) sex; and (d) high cholesterol. PMF obtained with real data (first column); SMOTEN (second column); TVAE (third column); and CTGAN (fourth column).
Figure 4.
Mean±std for the synthetic data quality metrics when considering different IR and oversampling techniques. (a) KLD; (b) HD; (c) MAEP; (d) RSVR; (e) PCD; (f) LCM.
Figure 4.
Mean±std for the synthetic data quality metrics when considering different IR and oversampling techniques. (a) KLD; (b) HD; (c) MAEP; (d) RSVR; (e) PCD; (f) LCM.
Figure 5.
Mean±std of the sensitivity (left panels) and AUC (right panels) considering 5 test subset partitions and different IRs, classifiers (LASSO, SVM, DT and KNN) and an oversampling approach with: (a,b) SMOTEN; (c,d) TVAE; and (e,f) CTGAN.
Figure 5.
Mean±std of the sensitivity (left panels) and AUC (right panels) considering 5 test subset partitions and different IRs, classifiers (LASSO, SVM, DT and KNN) and an oversampling approach with: (a,b) SMOTEN; (c,d) TVAE; and (e,f) CTGAN.
Figure 6.
Mean±std of the sensitivity (left panels) and AUC (right panels) considering 5 test subset partitions and different IRs, classifiers, and a hybrid approach (by combining undersampling and oversampling strategies) with: (a,b) SMOTEN; (c,d) TVAE; and (e,f) CTGAN.
Figure 6.
Mean±std of the sensitivity (left panels) and AUC (right panels) considering 5 test subset partitions and different IRs, classifiers, and a hybrid approach (by combining undersampling and oversampling strategies) with: (a,b) SMOTEN; (c,d) TVAE; and (e,f) CTGAN.
Figure 7.
Coefficient values for different IR when using LASSO (left panels), SVM (middle panels), and DT (right panels) and the oversampling methods: SMOTEN (first row); TVAE (second row); and CTGAN (third row).
Figure 7.
Coefficient values for different IR when using LASSO (left panels), SVM (middle panels), and DT (right panels) and the oversampling methods: SMOTEN (first row); TVAE (second row); and CTGAN (third row).
Figure 8.
SHAP summary plot of KNN model using the CTGAN oversampling technique and all subjects.
Figure 8.
SHAP summary plot of KNN model using the CTGAN oversampling technique and all subjects.
Table 1.
Summary of the features and categories in the dataset.
Table 1.
Summary of the features and categories in the dataset.
|
Feature |
Description |
Categories |
Background |
Age |
Indiviual’s age |
16-29, 30-39, 40-49, 50-59, 60-69, NA |
Sex |
Indiviual’s sex |
Woman (W), Man (M), NA |
BMI |
Body mass index |
HW, OBC-I, OBC-II, |
|
|
OBC-III, OW, UW, NA |
Education |
Education level achieved |
U<4Y, U≥4Y, PS10Y, HS≥3Y, NA |
HC |
Have cholesterol |
Yes, No |
Substance use |
Smoking |
Cigarette use |
CD, FO, FD, CO, N, NA |
Snuff |
Snuff use |
CD, FO, FD, CO, N, NA |
E-cigarette |
E-cigarette use |
CD, FO, FD, CO, N, NA |
Alcohol |
Alcohol consumption |
Yes, No |
Alcohol freq. |
Alcohol drink frequency |
2-3 p/w, 4 p/w, ≤ 1 p/m, 2-4 p/m, NA |
Alcohol units |
# units consumed |
1-2, 3-4, 5-6, 7-9, ≥ 10, NA |
≥ 6 units |
≥ 6 units of consumed alcohol |
D, W, <M, M, N, NA |
PA |
Strenuous PA |
# days of strenuous PA |
0, 1-2, 3-4, 5-6, 7, NA |
Moderate PA |
# days of moderate PA |
0, 1-2, 3-4, 5-6, 7, NA |
Walking |
# days of walking ≥10 minutes |
0, 1-2, 3-4, 5-6, 7, NA |
Daily sitting |
# hours sitting on a weekday |
0-2, 3-5, 6-8, 9-11, 12-14, ≥15, NA |
Dietary intake |
Extra salt |
Freq. extra salt added to food |
N, OCC, O, A, NA |
Sugary drinks |
# sweetened drinks |
0, 1, 2, 3, 4, 5, 6, ≥ 7, NA |
Fruits/Berries |
Fruit servings and berries |
0, 1, 2, 3, 4, ≥5, NA |
Vegetables |
Lettuce and vegetable servings |
0, 1, 2, 3, 4, ≥5, NA |
Red meat |
# consumed red meat |
0, 1, 2, 3, 4, ≥5, NA |
Other meat |
# consumed processed meat |
0, 1, 2, 3, 4, ≥5, NA |
Fish |
# consumed fish products |
0, 1, 2, 3, 4, ≥5, NA |
Income |
House income |
Gross household income |
≤ 150K, 150-350K, 351-550K, |
|
|
551-750K, 751-1000K, ≥ 1000K, NA |
Household adult |
# household members ≥ 18 years |
0, 1, 2, 3, ≥4, NA |
Household young |
# household members ≤ 18 years |
0, 1, 2, 3, ≥4, NA |
Table 2.
Mean±std (evaluated on 5 partitions) of data quality metrics for different oversampling techniques. The best values are shown in bold.
Table 2.
Mean±std (evaluated on 5 partitions) of data quality metrics for different oversampling techniques. The best values are shown in bold.
Method |
KLD |
HD |
MAEP |
RSVR |
PCD |
LCM |
SMOTEN |
0.092±0.004 |
0.168±0.006 |
0.298±0.014 |
0.609±0.011 |
2.011±0.135 |
-8.116±0.407 |
TVAE |
0.299±0.022 |
0.313±0.013 |
0.590±0.032 |
0.177±0.020 |
3.487±0.320 |
-3.926±0.517 |
CTGAN |
0.017±0.002 |
0.061±0.003 |
0.145±0.006 |
0.001±0.001 |
3.724±0.423 |
-6.110±0.235 |
Table 3.
Sensitivity, AUC (mean±standard deviation) on 5 test subsets when training ML models using different resampling strategies with all features and FS. The highest average performance for each figure of merit is marked in bold.
Table 3.
Sensitivity, AUC (mean±standard deviation) on 5 test subsets when training ML models using different resampling strategies with all features and FS. The highest average performance for each figure of merit is marked in bold.
Method |
Balancing Strategy |
Sensitivity (All) |
Sensitivity (FS) |
AUC (All) |
AUC (FS) |
RUS |
Under |
0.711±0.059 |
0.707±0.041 |
0.697±0.025 |
0.706±0.021 |
SMOTEN |
Over |
0.701±0.046 |
0.708±0.054 |
0.701±0.025 |
0.700±0.012 |
|
Hybrid |
0.686±0.028 |
0.707±0.014 |
0.694±0.012 |
0.709±0.020 |
TVAE |
Over |
0.636±0.067 |
0.640±0.013 |
0.650±0.027 |
0.668±0.024 |
|
Hybrid |
0.615±0.023 |
0.694±0.041 |
0.661±0.026 |
0.704±0.016 |
CTGAN |
Over |
0.699±0.044 |
0.695±0.021 |
0.702±0.026 |
0.707±0.017 |
|
Hybrid |
0.716±0.043 |
0.709±0.036 |
0.712±0.017 |
0.711±0.021 |