Principal Component Analysis (PCA)
Principal component analysis (PCA) is a statistical technique used for dimensionality reduction and data visualization. It is a linear method that finds a new set of orthogonal axes, called principal components, that capture the most variability in the data. The first principal component (PC) captures the most variation, the second PC captures the second most variation, and so on [
18,
19].
Mathematically, PCA can be described as follows:
Given a data matrix X with n observations and p features, we want to find a new set of p' features (p' ≤ p) that capture the most variation in the data.
The new features are linear combinations of the original features and are represented by a matrix Y: Y = X * W, where W is a p x p' matrix called the loading matrix.
The loading matrix is found by solving the following optimization problem:
where y_i is the i-th row of Y, μ_y is the mean of the rows of Y, and w_j is the j-th column of W.
The loading matrix W can be found using singular value decomposition (SVD) or eigenvalue decomposition (EVD).
The new features can be ranked by their contribution to the variation in the data. The first PC is the new feature that captures the most variation, the second PC captures the second most variation, and so on.
The main purpose of using PCA in this study is:
First: to determine the degree of correlation between variables, one of the main purposes of using PCA is to determine the degree of association between variables. PCA identifies patterns in the data and creates new, derived variables (called "principal components") that capture as much of the variation in the data as possible. These derived variables are orthogonal (uncorrelated) and ranked in terms of the amount of variation they capture. By examining the principal components, you can determine which variables are most highly correlated and how they are related to one another.
Figure 1 and
Figure 2 represent a correlation matrix and a correlation circle they are a graphical representation of the correlations between variables in the dataset. The results of the analysis revealed a strong correlation among the majority of the variables, indicating a high degree of interdependence among the factors studied, the correlation between the variables was found to range from strong to weak, with some variables showing a moderate correlation with others.
The correlation matrix confirms that the variables are strongly correlated, but also that exercise, family history, height, and HDL have a very weak correlation with other variables.
The strong correlation found among the variables in this study highlights the importance of considering multiple factors when studying the CVD risk factor for obeys patients. The weak correlation found between the variables exercise, family history, height, and HDL suggests that these factors may play a less significant role in the relationship with other variables for the specified dataset. However, it's important to note that the correlation between variables may vary depending on the dataset and population being studied, and these results should be interpreted with caution in the general population.
Table 2 show the squared cosines, also known as factor loadings, indicate the correlation between each variable and each PC(F). They are represented as a matrix with the variables on the rows and the PCs (F1, F2…) on the columns. A high squared cosine value for a variable and a PC means that the variable is strongly associated with that PC. This can be used to understand which variables are driving the variation in the data for each PC. Values in bold correspond for each variable to the factor for which the squared cosine is the largest.
Second use Principal Component Analysis (PCA) as a tool to determine the optimal number of clusters in the dataset by analyzing the proportion of variance explained by each principal component, the relationship between the variables, and the representation of the data on the biplot.
Figure 3 shows Scree plot, the Scree plot is a graphical representation of the eigenvalues of the principal components (PCs). The plot displays the eigenvalues on the y-axis and the number of PCs on the x-axis. The point at which the eigenvalues level off is called the "elbow" of the plot. The number of factors(PCs) before the elbow is considered the optimal number of clusters in the dataset. The Scree plot helps to identify the number of PCs that explain the most variance in the data, and thus helps to determine the number of clusters in the dataset [
20,
21].
Figure 4 shows a biplot, which is a graphical representation of the data on a two-dimensional plane, where the first two PCs (F1and F2) are used as the x and y-axis. Each variable is represented by a vector, and each observation is represented by a point. The angle between the vectors and the position of the points on the biplot can be used to interpret the relationship between the variables and the observations. By analyzing the biplot, one can identify natural groups or clusters in the data.
The Scree plot helps to identify the number of PCs that explain the most variance in the data, and the biplot helps to identify natural groups or clusters in the data. Together, these techniques provide a comprehensive understanding of the data and help to determine the optimal number of clusters in the dataset.
It is important to note that PCA is a linear method, so it may not be able to capture all the non-linear relationships in the data, and it should be combined with other techniques, such as the clustering technique to gain more insights from your data.
In the following section, we will be utilizing the Fuzzy C-Means cluster technique for further exploration and to gain deeper insight into the data, in order to capture all the non-linear relationships, present within the dataset.
Fuzzy C-Means Clustering (FCM)
FCM stands for fuzzy c-means clustering. It is an unsupervised machine-learning algorithm used for clustering data points into a specified number of clusters [
22].
The Fuzzy C-Means algorithm, as the name suggests, implements fuzzy logic into the standard k-means algorithm, allowing for a more nuanced and flexible clustering approach. Unlike the hard clustering methods, where each data point belongs exclusively to one cluster, FCM allows for the possibility that a data point can belong to multiple clusters with varying degrees of membership. This fuzzy membership can capture the subtle complexities and inherent uncertainties that may exist within the dataset, thus providing a more realistic representation of the data structure [
23,
24].
FCM is commonly used in various fields, including pattern recognition, image processing, and data mining.
The mathematical formula for FCM is as follows:
Given a dataset X with n data points and m features, and the desired number of clusters k, the objective of FCM is to partition the data points into k clusters such that the sum of squared errors (SSE) is minimized.
The SSE is calculated as:
where cij is the membership value of data point i in cluster j, m is the fuzziness coefficient, xi is the i-th data point, and vj is the centroid (mean) of cluster j.
where || xi - vl || is the distance between data point xi and the centroid vl of cluster l.
The results obtained from the Principal Component Analysis (PCA) provide valuable insights into the correlation between the variables, the proportion of variance explained by each principal component, and the number of possible clusters in the dataset as indicated by the Scree plot and biplot. However, it is important to note that PCA is a linear method and may not capture all non-linear relationships present in the data. In order to gain a more comprehensive understanding of the underlying structures in the dataset, we will be applying the Fuzzy C-Means (FCM) algorithm. This non-linear clustering method has been proven to be an effective tool for uncovering hidden patterns in data, as demonstrated in various recent studies and references such as [
25,
26,
27]. By using FCM, we aim to gain a deeper understanding of the complex relationships within the dataset.
We will be using the results from the PCA to guide our application of the Fuzzy C-Means (FCM) algorithm. This approach will allow us to make the most effective use of the insights gained from the PCA.
The Scree plot and biplot from the PCA can be instrumental in determining the number of clusters in the FCM algorithm. The Scree plot shows the eigenvalues of the principal components in descending order, and the point where the decline in eigenvalues becomes less steep (often referred to as the 'elbow') can suggest the appropriate number of clusters. Similarly, the biplot can provide a visual representation of the data points and the principal components, which can aid in identifying clusters and their compositions.
The PCA also allows us to understand which variables have the highest impact on the dataset. The principal components are linear combinations of the original variables, weighted by their contribution to explaining the variance in the data. Therefore, variables that contribute most to the principal components can be considered the most impactful. These variables will be used as key inputs in the FCM algorithm. By applying FCM, we will be able to further explore the structure of the data, focusing on the clusters that emerge, and the relationships between the most impactful variables within these clusters. Given the fuzzy nature of the FCM algorithm, this will allow us to understand the degree of membership of each data point in the various clusters, providing a nuanced view of the data. To apply the Fuzzy C-Means (FCM) algorithm and evaluate the best clustering model among several options, we can follow the steps below:
A. Determine the Number of Clusters
We will use the Scree plot and biplot from the PCA to suggest the number of clusters. The 'elbow' in the Scree plot often indicates the optimal number of clusters.
Figure 3 effectively illustrates the descending eigenvalues, forming distinct 'elbows' at the points represented by factors F3, F4, and F5. These inflection points serve as strong indicators for the optimal number of clusters within the data.
To further investigate and pinpoint the best representation of homogeneous natural groups within the dataset, we will conduct a comparative analysis of three distinct clustering models. These include Model 1, which is composed of three clusters, Model 2 with four clusters, and Model 3 that houses five clusters.
In this analysis, we will also take into account the graphical insights provided by
Figure 4. This biplot serves to visually highlight and identify the inherent natural groups within our data. The alignment of these natural groups with our clustering models will play a crucial role in assessing the effectiveness of each model and ultimately determining the most fitting representation of our dataset.
B. Models Selection and Evaluation
In the context of clustering models, several metrics are commonly deployed to evaluate the quality of the clusters formed. These include Between Cluster Variation, Partition Coefficient, Partition Entropy, and the Silhouette Score. Each of these metrics provides different insights and they are often used in combination to assess the overall quality of the clustering model.
Between Cluster Variation (BCV): This metric measures the variance between clusters. A model with a higher Between Cluster Variation is typically considered better, as it signifies distinct clusters [
28].
Partition Coefficient (PC): The PC measures the 'fuzziness' or overlap of the clusters in a fuzzy clustering model. A higher Partition Coefficient indicates less fuzziness, meaning the data points are more clearly assigned to one cluster than others [
29].
Partition Entropy (PE): The PE is another measure of fuzziness in a fuzzy clustering model. Unlike the Partition Coefficient, a lower Partition Entropy indicates less fuzziness. Therefore, the model with the lowest Partition Entropy is considered the best [
29]
Silhouette Score(SS): The silhouette score measures how close each point in one cluster is to the points in the neighboring clusters. It ranges from -1 to 1, with 1 indicating that the clusters are well apart from each other and -1 indicating that the clusters are too close to each other. The higher the silhouette score, the better the clustering solution [
30].
The combined use of these metrics can provide a comprehensive evaluation of the quality of a clustering model. They each offer a unique perspective and together they can help to identify the most effective model for a given dataset.
Based on the analysis of the metrics in
Table 3, Model 2 appears to be the superior clustering model for the given situation. Although Model 1 has a slightly better Partition Coefficient and lower Partition Entropy, suggesting a lower degree of fuzziness and less randomness, these advantages are outweighed by the significantly higher Between Cluster Variation and Silhouette Score of Model 2. The higher Between Cluster Variation score in Model 2 indicates that its clusters are more distinct from each other. Also, the higher Silhouette Score in Model 2 suggests that the data points are well clustered and that they fit better within their assigned clusters than with the data points in the other clusters. Therefore, given the importance of these two metrics in assessing the quality of a clustering model, it can be concluded that Model 2 is the preferable choice for this dataset. Taking into account the nature of dataset, which contains information about obesity and Cardiovascular Disease (CVD) risk, it is plausible to expect four natural subgroups in the data: high risk, medium risk, low risk, and healthy individuals.
While Model 3, with five clusters, was considered, it was ultimately excluded due to its overall lower performance metrics compared to both Model 1 and Model 2. Despite having a Between Cluster Variation (BCV) of 81.7, which suggests a reasonable level of distinction between clusters, its Silhouette Score (SS) is only 0.3. The Silhouette Score is a critical measure indicating how well each data point has been assigned to its cluster compared to other clusters. A low score, such as 0.3, indicates that the data points might not be appropriately grouped, suggesting that the clusters in Model 3 are not as coherent or meaningful as those in the other models. Given this context, Model 2, with its four clusters, may provide a more intuitive and meaningful interpretation of the data, aligning well with these expected subgroups.
The subsequent synthesis is extrapolated from the health metrics delineated in
Table 4, which furnishes an exhaustive decomposition of principal variables within four discrete groupings. These parameters are pivotal to the evaluation of cardiovascular disease (CVD) risk, encompassing age, blood pressure, body mass index (BMI), waist circumference, glycemic indices, insulin sensitivity, and lipidomic profiles. The characterizations for each cluster capture the aggregate data trajectories and potential CVD dangers as suggested by the evidence in
Table 4.
Cluster _0 encompasses Younger Adults manifesting a Moderate Risk profile
- The Age Bracket is marked by the youngest cohort (20-37 years) with an approximate mean age of 29 years.
- Blood Pressure measurements indicate mean systolic and diastolic pressures within acceptable parameters; nevertheless, instances of augmented systolic pressure were observed.
- BMI: The average BMI signifies a preponderance towards overweight status, with certain individuals classified as obese.
- Waist Circumference & Waist-Hip Ratio: Both measurements are elevated, denoting central adiposity - a salient risk determinant for CVD.
- Glycemic and Insulin Sensitivity Indices: Fasting blood sugar levels are marginally raised, while Homeostatic Model Assessment for Insulin Resistance levels are heightened, inferring the presence of insulin resistance, a prognosticator for both diabetes and CVD.
- Lipidomic Profile: A moderate increase in cholesterol levels is discernible; LDL concentrations are skewed towards the upper range – a fact that escalates CVD risk. Notwithstanding, HDL ratios predominantly remain within normal bounds and triglyceride values approach the higher threshold of normalcy.
Cluster_1 encapsulates Older Adults at Elevated Risk:
- The Age Range for this cluster spans older participants (55-72 years), averaging roughly 61 years.
- Blood Pressure: The generalized systolic and diastolic pressures are loftier; numerous subjects report antecedent hypertension or are undergoing pharmacological intervention.
- BMI metrics assert similarities to Cluster 1 concerning the rates of overweight and obesity prevalence.
- Waist Circumference & Waist-Hip Ratio: Average figures convey the presence of central obesity – a significant risk contributor to CVD.
- Blood Sugar and Insulin Resistance: Comparative analysis reveals that fasting blood sugar as well as HOMA-IR levels surpass those within Cluster 1, underscoring an increased incidence of impaired glucose tolerance, manifest diabetes and insulin resistance.
- Lipid Profile: Ascending total cholesterol, LDL cholesterol concentrations along with triglycerides juxtaposed with diminished HDL quantities epitomize a composite high risk factor for CVD.
Cluster _2: Middle-Aged with Diverse Risk Profile
This demographic encompasses individuals aged 47 to 56, averaging approximately 51 years. Blood pressure measurements largely fall within normal ranges, albeit with some notable deviations. Body Mass Index (BMI) exhibits considerable variation, with a spectrum ranging from normal weight to obesity present in the population. In terms of abdominal adiposity, average waist circumference and waist-to-hip ratio suggest a lower prevalence of central fat accumulation when compared to Groups I and II.
Glycemic control appears predominantly adequate among this cohort, as reflected by generally normal fasting blood sugar levels and Homeostatic Model Assessment for Insulin Resistance (HOMA-IR) indices that are lower than those observed in the preceding groups; this implies a diminished likelihood of diabetes mellitus and cardiovascular diseases (CVD).
Lipidemic profiles within this cluster indicate moderate cholesterol concentrations with low-density lipoprotein (LDL) values tending towards preferable ranges, signaling a reduced relative risk for CVD in comparison to Group II.
Cluster _3: Early Middle-Aged with Elevated Risk Indices
Individuals in the early middle-age category, ranging from 37 to 48 years and with an approximate mean age of 43 years, form Group IV. The population's mean systolic and diastolic blood pressures exceed typical values, denoting a potential risk for hypertension. The mean BMI falls within the overweight classification, often tipping into obesity.
Central obesity is significantly represented in this group as evidenced by average waist circumference and waist-to-hip ratios, factors known to augment CVD risk.
The cohort exhibits heightened fasting blood sugar and HOMA-IR levels indicative of an increased susceptibility to diabetes mellitus and cardiovascular conditions.
The lipidemic status is characterized by elevated total cholesterol and LDL concentrations that pose an increased risk for CVD, alongside high-density lipoprotein (HDL) levels that fail to offer an adequate protective effect.
The foregoing analysis delineates the potential cardiovascular risks associated with each distinguished cluster based on data derived from
Table 4. It is imperative to acknowledge the potential variation in individual risks and consider additional contributory factors such as lifestyle choices, dietary patterns, and genetic predispositions in the comprehensive assessment of cardiovascular disease risks.
Figures 5 through 13 illustrate density curves that encapsulate a nuanced visual interpretation of the distribution patterns of critical variables including age, systolic and diastolic blood pressure (SBP and DBP), body mass index (BMI), total cholesterol (TC), low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), triglycerides (TG), and visceral adiposity among the four identified clusters. These graphical representations elucidate variations in the prevalence and magnitude of cardiovascular disease (CVD) risk determinants among the clusters.