2.1.1. Analysis Using Electronic Health Record
In the use of EHR, our results are in 2-parts listed below:
- (1)
Data Preprocessing and Interpretation from EHR
Albumin-Creatinine Ration Test as Basis for Assessing DFU Severity in Hispanics
- (2)
Data Preprocessing and Interpretation from EHR
In this study, we processed Electronic Health Record (EHR) data and shed light on notable demographic distinctions. Our analysis started with a thorough analysis of EHR data consisting of 8,969 de-identified patient records. In order to work with only the most relevant dataset, we meticulously filtered the dataset by categorizing patients into Type 1 Diabetes, Type 2 Diabetes, and an 'Other' group, subsequently excluding entries under 'Other' from our analysis.
Our hypothesis is focused on identifying molecular differences across demographic groups within the laboratory test results in the EHR. These differences, influenced by genetics, lifestyle, and socio-economic factors, may have contributed to the varying likelihood of developing Diabetic Foot Ulcer disease. To enhance data quality, entries lacking laboratory data were excluded, leaving us with 7,153 patient records for analysis. As our study placed particular emphasis on the Hispanic demographic, we split the dataset between Hispanic and non-Hispanic groups.
Figure 1 below illustrates the initial filtration steps of the dataset.
Figure 2.
Flowchart of Patient Selection for Study on Diabetic Patients by Hispanic Origin.
Figure 2.
Flowchart of Patient Selection for Study on Diabetic Patients by Hispanic Origin.
The dataset also contained other demographic information that may be used as labels for further feature selection, such as Vital Status, which details whether the patient is alive or deceased; the Biological Sex of the patient; the Current Federal Poverty Line (FPL) which gives us insight into the economic power of the patients; details on if the patient lives in a Rural Community, and finally, whether the patient lives in the northern or southern states of the United States. Analysis of the dataset along with this demographic information may give some insight into the effects of their difference on the likelihood of the development of the disease. In furtherance of this work, we use the Vital Status information to develop our risk index.
The EHR also contains the LOINC codes used to record laboratory tests conducted to diagnose and manage each patient's disease. The dataset contains 63 of these codes, which we use for further analysis. Because we are especially interested in significant differences in the diagnoses of this disease between Hispanic and non-Hispanic people, we shifted our scrutiny to the most important distinguishing features that separate these two classes. To do this, we evaluated the Mean Decreasing Accuracy between the Hispanic and non-Hispanic demographic groups. The concept of feature importance was utilized to gauge the impact of individual features on model accuracy. Mean Decrease in Accuracy was computed, highlighting the degree to which model performance fluctuated when specific feature values were randomized. By sorting and presenting features based on their mean decrease in accuracy, the crucial contributors to model precision were unveiled. This information, visualized through plots, sheds light on the key factors influencing predictions for both classes. Our objective was to substantiate our hypothesis that the results of laboratory tests concerning various demographic factors provide insights into molecular activities that contribute to the occurrence of DFU.
We started by collating the top 10 essential laboratory tests in the Hispanic versus non-Hispanic classification task. We collated the top 10 features for each class; we removed any reoccurring test labels, and we were left with 18 test labels. Next, we considered the mean values across all the patients against each of these labels. We then use the P-value test to check for significant statistical differences between the Hispanic and non-Hispanic classes across each test using the Mann-Whitney U test. We singled out the labels with the most statistical significance and used this as a basis for further comparison.
Table 1 below shows all the tests with key statistical variables.
- (3)
Albumin-Creatinine Ratio Test as Basis for Assessing DFU Severity in Hispanics
We observed that the Albumin/Creatinine ratio (ACR) test showed a significant statistical difference from the p-value using the Mann-Whitney U test of 5.85e-14 between the Hispanic and non-Hispanic origin. We are able to show further key molecular and protein compositions that may differ between Hispanic and non-Hispanic patients, thereby highlighting any disparities in health outcomes.
The Albumin/Creatinine ratio test is measured using urine, which is a ratio test of albumin and creatinine levels. Doctors assess this to determine early signs of kidney disease. Chronic Kidney Disease (CKD) is also a complication of diabetics, and the development of CKD means that the DFU in patients has progressed [
9]. We check for statistically significant differences between Hispanic and non-Hispanic labels using ACR as a risk index. Figure 5 above is a box plot that shows the mean, median, and p-value by the Mann-Whitney U test, indicating a significant statistical difference between both groups.
In this analysis, we focused on identifying the key features that differentiate high and low ACR values based on the median value of all ACR measurements in the Electronic Health Records (EHR). By segmenting the ACR data into values above and below the median, we can pinpoint specific factors that contribute to higher ACR levels, which are indicative of greater health risks. This approach allowed us to utilize advanced machine learning techniques to evaluate feature importance, helping us to identify significant predictors of high ACR values.
Understanding these predictors aids in developing targeted interventions and improving clinical outcomes for both Hispanic and non-Hispanic populations. Mean Decrease Accuracy is significant because it pinpoints the importance of features within a classification model. Features causing substantial accuracy drops when altered are deemed crucial, while those with minimal impact are considered less significant. Below is the Mean Decreasing Accuracy evaluation for the Hispanic and non-Hispanic populations. Figure 4 shows the most common blood tests unique to Hispanic and non-Hispanic populations.
Figure 3.
Box plots illustrating significant differences in Albumin/Creatinine Ratio. (a) Demonstrates statistical variance between Hispanic and non-Hispanic groups. (b) Highlights notable differences between surviving and deceased individuals within the Hispanic.
Figure 3.
Box plots illustrating significant differences in Albumin/Creatinine Ratio. (a) Demonstrates statistical variance between Hispanic and non-Hispanic groups. (b) Highlights notable differences between surviving and deceased individuals within the Hispanic.
Figure 4.
Top 10 Important Features for ACR in Hispanics and non-Hispanic: This chart highlights the most influential factors for Albumin Creatinine Ratio (ACR), (a) for Hispanic origin and (b) for non-Hispanic origin.
Figure 4.
Top 10 Important Features for ACR in Hispanics and non-Hispanic: This chart highlights the most influential factors for Albumin Creatinine Ratio (ACR), (a) for Hispanic origin and (b) for non-Hispanic origin.
We observed that the distinguishing tests for the Hispanic group consist of lipid tests (LDL and VLDL Cholesterol), Metabolism and mineral absorption tests (Calcium), White blood cells (Neutrophils/ 100 Leukocytes), and red blood cells (Erythrocytes).