3.1. Characterization of the Urinary Volatilome
The volatile composition of the 132 urine samples was analysed using HS-SPME/GC-MS. The
supplementary Figure S1 shows the typical chromatographic profiles obtained for the control (CTRL), COVID-19 patients (COVID), and recovered subject (RECOV) groups using the HS-SPME/GC-MS methodology. Overall, a larger number of peaks with higher intensities were observed in the chromatographic profile of the COVID group than in the CTRL profile (
Figure S1, Supplementary material). Furthermore, the number and intensity of peaks in the RECOV group were intermediate between those in the COVID and CTRL groups.
A total of 101 VOMs belonging to 13 chemical families were identified. Data regarding the VOMs detected in the analysed samples, frequency of occurrence, and mean relative peak areas are available in
Supplementary Table S1. As shown in
Figure 3, terpenes, phenolic compounds, norisoprenoids, and ketones were the main contributors to the urinary volatile profiles of the studied groups. However, there were significant variations in the levels of these and less represented chemical families among the groups studied, with decreased levels of terpenes, phenolic compounds, benzene derivatives, hydrocarbons, and aldehydes in COVID-19 patients compared to the control subjects. In contrast, increased levels of norisoprenoids, ketones, alcohols, furans, sulfur compounds, and naphthalene derivatives were observed. For most chemical families in the RECOV group, the sum of the average peak areas was similar to that of the control group, except for benzene derivatives, terpenes, and phenolic compounds, whose levels were similar to the Covid19 group. It should also be highlighted that the variations in the relative levels of the different chemical families are broadly caused by an increase in the decrease of the same VOMs, as the numbers for each chemical family in each group do not vary significantly. The only exception are ketones, with 10, 9, and 13 ketones identified in CTRL, COVID and RECOV groups, respectively.
Volatile organic metabolites can have different origins. They can be endogenous as result of bacterial activity or pH changes, can be the product of metabolic pathways or oxidative stress. They can be influenced by external factors, including health status, diet, habits, physical stress, and environmental exposure. So, the human metabolom is highly complex, it’s difficult to understand if an increase or decrease in certain metabolites are related with a specific disease or illness.
This is why it is crucial to establish a relationship between the identified VOMs and their potential endogenous origin; however, the origin of many VOMs have not been clearly defined.
Figure 4 shows the metabolomic pathways responsible for the origin of some chemical groups of endogenous VOMs.
3.2. Chemometric Analysis of Urine Samples
A data matrix of the relative peak areas of the 101 VOMs identified in the three groups under study, the COVID, RECOV, and CTRL groups (
Supplementary Table S1), was processed using the Metaboanalyst software package [
31]. Only VOMs with a frequency of occurrence (FO) higher than 80% in the volatile composition of urine were considered. To obtain a consistent distribution without redundant values, the variables were normalised, and univariate analysis was performed using a t-test (p < 0.05). Consequently, 17 VOMs with insignificant contributions to the statistical analysis were removed from the data matrix.
In multivariate pattern recognition procedures like Partial Least Squares Discriminant Analysis (PLS-DA), the information present in the VOMs fingerprint was utilized as multiple variables to visualize group trends and clusters. The outcome of the PLS-DA analysis revealed that there were two distinct groups well-separated for the COVID-CTRL comparison analysis. To evaluate the robustness of the model, a 10-fold cross validation performance was conducted using PLS-DA (
Figure 5).
Figure 5c showed R2 and Q2 values close to 1, which represent respectively the goodness of fit and the predictive ability for distinguishing between different study groups. (*) represented the best value of Q2 for the PLS-DA model. A random permutation test involving 1000 permutations, was carried out to assess the statistical significance of the class discrimination obtained (
Figure 5d). Additionally, the top 10 variables of important in projection (VIP > 1) score plot was reported (
Figure 5b) that illustrates the relative contributions of the metabolites in explaining the variance observed between the COVID and CTRL groups. 1,1,6-Trimethyl-dihydronaphthalene (TDN) and 2-heptanone showed a more significant contribution for the COVID groups and D-carvone and 3-Methoxy-5-(trifluoromethyl)aniline (MTA) showed a more significant contribution for the CTRL group.
The same multivariate analysis was performed to compare the data of SARS-CoV-2 infected urine samples with the recovered from COVID-19 urine samples (
Figure 6). Even in this case, the PLS-DA segregated the COVID and RECOV samples in two well-separated clusters corresponding with the infected and on the mend patients (
Figure 6a). The 10-fold CV performance and permutation test showed a good robustness of the PLS-DA model (
Figure 6c,d). In the
Figure 6b, VIP score plot revealed that β-damascenone α-isophorone gave a higher significant contribution to discriminate COVID group and nonanoic acid and α-terpinene gave a most significant contribution to discriminate RECOV group.
Hierarchical clustering analysis of the volatilomic data was carried out for the two comparisons, COVID-CTRL and COVID-RECOV, through the heatmap and dendogram. Heatmap was created using Spearman’s distance correlation to build a visual representation of the data set, specially focusing on the 15 most relevant metabolites to discriminate the two groups of study. Heat map allows an intuitive description of the relationship between samples and detected volatile metabolites. The coloured representation of the cells corresponds to the concentration of the detected VOMs for each sample (dark blue, less concentrated, and dark red, more concentrated).
The comparison between COVIDand CTRL groups, the analysis revealed two well-defined clusters (
Figure 7a). The urinary VOMs 2-methoxythiophene, toluene, α-isophorone, TDN, hemimellitene showed a higher correlation with urine profile of COVID-19 patients. Piperitone, β-ionone, D-carvone, eudalene resulted more related with urinary profile of CTRL (control) group. Dendogram was able to split completely the samples in two groups matching the real group of study (
Figure 7b). Although the heatmap was perfectly able to cluster volatilomic data from COVID-19 patients and recovered individuals, the clusters accuracy was visually lower than the first analysis (COVID-CTRL), highlighting that the COVID-19 patients’ urinary profile is closer to the recovered individuals. Urinary VOMs such as hemimellitene, furan, β-damascenone, and α-isophorone showed a higher correlation with COVID group, while 2,4-dimethylbenzaldehyde, nonanoic acid, 1-methylcycloheptene, and α-terpinene were more related with recovered indiciduals’ volatile profile (
Figure 7c). Dendogram divided only partially the samples of these two different groups (
Figure 7d).
For the classification of true positives and false positives and the predictive ability, multivariate exploratory receiver operating characteristic (ROC) curves were created using Monte Carlo cross-validation (MCCV) methodology. The features importance, selected using 2/3 of the samples, were utilized to construct classification models, which were validated on the remaining 1/3 of the samples that were not initially used. This process was repeated several times to determine the performance of each model and calculate confidence intervals. From these samples, the top 3, 5, 10, 20, 30, and 61 important features were identified, and the built curves were reported (
Figure 8a,c).
Figure 8a display the ROC curve for different sets of important features for the COVID-CTRL (COVID-19 patients and control subjects). The area under the curve (AUC) values obtained, ranging from 0.988 to 1, indicated excellent discriminative accuracy between the two groups. Plot in the
Figure 8c illustrated the ROC curves for the COVID-RECOV comparison (COVID-19 patients and infected subject in the recovering period). In this case, the area under the curve (AUC) values fell in a range from 0.937 to 0.987 that also show an optimal ability to discriminate between the groups of study. These values were calculated with a 95% confidence interval, demonstrating the reliability of the results.
Figure 8b,d illustrated the predictive accuracy of the biomarker models as the number of features increases. As more features are included in the models, the predictive accuracy improves. This suggests that the selected features contribute to the differentiation between the controls (CTRL) and COVID-19 (COVID) groups, and between COVID and recovered (RECOV) groups. The predicted class probabilities were assessed through the performance of the classification model for COVID-CTRL groups (
Figure 8e) and COVID-RECOV groups (
Figure 8f). Overall, the results demonstrate promising performance of the biomarker models, with high accuracy in distinguishing between the two groups.
The boxplots of the most important variables (VOMs) for discriminating between COVID and CTRL groups, and between COVID and RECOV, were plotted in
Figure 9.
Some VOMs, such as D-carvone, MTA, TDN, and α-terpinene, are associated with diet [
13,
14,
15,
16,
32,
33] and as a result, their interpretation as potential biomarkers for COVID-19 infection and progression is not straightforward.