1. Introduction
Internationally, breast cancer continues to be the primary cause of mortality in women, surpassing both lung and skin cancers. The American Society of Clinical Oncology (2024) predicts that there will be a remarkable 297,790 new instances of invasive breast cancer and 55,720 cases of non-invasive breast cancer in the United States in 2024 [
1]. Additionally, 2,800 cases are expected to be detected in men. Furthermore, there are currently more than 3.8 million women who are either living with or have survived this disease. Early identification is essential for successful therapy and possibly halting the advancement of a condition. However, existing techniques such as mammography, which are mostly advised for women between the ages of 40 and 75, have several drawbacks. False positives, a common issue in lung cancer screening, also afflict mammography, resulting in unneeded biopsies, stress, discomfort, and radiation exposure. In addition, excluding younger or older individuals from routine cancer screenings risks missing a significant number of cancer cases. The invasive nature of biopsies currently employed to detect and analyze tumor biomarkers highlights the critical need for non-invasive methods for early-stage breast cancer detection. Such advancements would facilitate prompt, targeted interventions, potentially mitigating mortality risks associated with late diagnosis [
2].
Metabolites, the small biomolecules that act as distinct chemical markers of our metabolic activities, have great potential to enhance the accuracy and precision of breast cancer screening and early diagnosis. Alteration of the metabolomics profile of an individual is often arise from a change in a gene, whether that be a gene mutation, over-expression, or downregulation, and these changes eventually could facilitate cancer development. Metabolites are also closely linked to the phenotype of an organism, and which can have a significant impact on the human health. Knowing that breast cancer is a highly complex and heterogenous disease with an array of clinical presentation and responses to therapy, metabolomic profiling of breast cancer patients offers a robust way of capturing a patient’s phenotype. This makes it especially useful for monitoring individuals at all risk levels, including those who are not in the normal screening age categories. While univariate and multivariate analyses of metabolite datasets from urine samples have shown promise, the need for more robust approaches has been highlighted [
3].
Through the utilization of machine learning and deep learning algorithms on metabolomics data, it is possible to conceive the creation of models that may identify breast cancer and potentially different subtypes of cancers even prior to the manifestation of symptoms. Comprehensive metabolite profiling offers important perspectives on the fundamental processes driving cancer cell growth. Tumor cells exhibit altered metabolic profiles that indicate their increased energy demands, enhanced proliferation, and capacity to avoid programmed cell death. These alterations are observed in the levels of metabolites associated with many processes, including glycolysis, lipid metabolism, and amino acid metabolism. Glycolysis promotes the uptake of glucose and the production of lactate to support rapid cell division [
4]. The process of synthesizing fatty acids to aid in the formation of cell membranes and enable cellular communication modifies lipid metabolism [
5]. Heightened utilization of glutamine for anabolic functions and altered amino acid compositions, such as protein synthesis and degradation, stimulate metabolic pathways of amino acids [
6,
7]. By quantitative measure of these metabolites in blood samples and building diagnostic classifiers, we gain insights into the unique energy demands and vulnerabilities of cancer cells, potentially enabling earlier detection and more targeted treatment strategies compared to traditional approaches.
This study proposes a model that can ultimately pave the way towards development of non-invasive, routine, low-cost, and reliable diagnostics tests. We explore the utility of diverse machine learning techniques and statistical methods for breast cancer feature selection and classification. The principal aim of this research is to construct robust, parsimonious, and interpretable models for early-stage breast cancer detection utilizing strategically selected, information-rich metabolomics data. This approach has the potential to offer a safer, more personalized, and readily accessible alternative to established screening methods.
3. Discussion
It is apparent that cancer has a genetic component and, in fact, has been generally accepted as a genetic disease [
16]. It is well known that variations in genetic make-up can influence susceptibility to certain diseases including cancer. Furthermore, epigenetic factors (DNA methylation and histone modification) are considered likely to play important roles in the pathogenesis of cancer. Although a number of blood-based cancer assays that detect protein, microRNA, circulating DNA, and methylated DNA biomarkers have been developed they are, however; specific to late-stage cancer and thus application for screening and/or early detection is rather limited. Furthermore, analytical techniques that require biopsy material for molecular diagnosis are invasive and uncomfortable for the patient and exhibit a concern pertaining to inaccurate interpretations. It is known that metabolites and genes are intimately connected [
17]. Indeed, a single DNA base change in a given gene can lead to 10,000-fold shift in the generation of metabolite concentrations that are the products of a sequence of events i.e. gene transcription, translation, and subsequent protein synthesis and enzyme activation [
18,
19]. Accordingly, there is an amplification of the signal from DNA to protein to metabolites. It should be mentioned that there are several factors that can affect the metabolome including ethnicity, sex, age, diet as well as geographical location and environment [
20]. Therefore, there are specific metabolomic signatures that could constitute a panel of biomarkers with huge clinical application and significance for not only diagnostic and prognostic value in cancer, but also as a predictive tool/early detection of cancer in high-risk populations. While our work is not meant to de-emphasize genetic and molecular components of cancer, the field of metabolomic biomarkers is a complimentary field that can be utilized to assist existing screening/surveillance technologies [
21].
The present study identified a minimal 6-variable panel from a metabolomics dataset that demonstrated high accuracy in breast cancer detection. This finding holds significant promise for the development of non-invasive, accurate, and cost-effective diagnostic tools for breast cancer. The use of a metabolomics approach offers a unique perspective on cancer diagnosis, as it bypasses limitations associated with more invasive and late-stage detection methods. Metabolites represent the downstream products of cellular processes, providing a closer reflection of the functional state of a cell or tissue. In contrast to the established cancer screening methods, this approach has the potential to capture subtle metabolic alterations associated with early-stage cancers, potentially enabling earlier detection and intervention. The employed feature selection strategies enhanced the robustness and generalizability of the findings. By combining three distinct feature selection methods, the study effectively identified the most discriminative and non-redundant biomarkers within the dataset. Our findings underscore the criticality of multi-source data integration, incorporating both demographic and metabolomic profiles, to offer a more holistic perspective on patient health and improve cancer prediction. Metabolomics data presents unique challenges for standard analytical models. The inherent complexity of metabolomic data, with its large number of interconnected variables and often limited sample sizes, presents a significant challenge in identifying the most informative biomarkers. This complexity may explain why, despite widespread use of PLS/regression feature selection in metabolomic studies, there remains limited consensus on reliable metabolomic indicators of breast cancer.
Our research addressed this by systematically comparing multiple feature selection strategies to derive a robust and reliable panel of biomarkers, achieving a perfect AUC and a 98% accuracy. Despite these strengths, we share some limitations with other studies, namely a relatively small sample size. Additionally, the study focused solely on diagnostic accuracy. While this is an essential initial step, future research should explore the panel's potential for risk stratification, treatment response prediction, and early-stage cancer detection. Supervised and unsupervised machine learning (ML) have also emerged as a potent tool in multi-omics analysis, aiding in the identification of patterns and improved outcomes across diverse biological variables. Numerous studies have demonstrated their efficacy in profiling various omics data sources, including proteomics, genomics, metabolomics, and transcriptomics. Sugimoto et al. [
22], for instance, employed various classification ML models like Random Forests, Naïve Bayes, and Support Vector Machines to analyze genomics data and gene assays. Their study, utilizing cross-validation for robust comparisons, highlighted the ability of ML to extract valuable insights from multi-omics data for tasks such as disease classification and biomarker discovery.
Other machine learning (ML) algorithms like Decision Trees, PCA, t-SNE and PLS have also shown promising results in identifying and classifying various cancers based on metabolomic data such as ovarian [
23,
24,
25], lung [
26,
27,
28], endometrial [
29], skin and kidney carcinomas [
30,
31,
32], glioma and meningioma brain tumors [
33], and non-Hodgkin's lymphoma [
34]. For breast cancer particularly, Henneges et al. [
35] achieved sensitivity and specificity of 83.5% and 90.6% respectively with an SVM-based metabolomic approach. Using an ensemble based ADTree model, Mutata et al. [
36] reached an accuracy of 91.2% in discriminating between breast cancer patients and the control group. A LASSO regression model, applied to a subset of 22 biomarkers specifically selected from triple-negative breast cancer patients, achieved an overall diagnostic accuracy of 93% and an AUC of 96%. The model also exhibited high sensitivity of 96% and specificity of 91%.
Metabolomic profiling of plasma from breast cancer patients has revealed distinct metabolite signatures not only when compared with healthy individuals, but also across various disease stages and demographic profiles. Jasbi et al [
37] found significant variations in the levels of the same metabolites even between early-stage breast cancers (Stages 1 and 2), underscoring the sensitivity of these biomarkers to subtle shifts in tumor metabolism. Their analysis revealed significant differences in the levels of proline, myoinositol, 2-hydroxybenzoic acid, gentisic acid, hypoxanthine, and 2,3-dihydroxybenzoic acid [
37]. The inclusion of age as a feature to a PLS-DA model, alongside the six metabolites enhanced the model's discriminatory power, achieving an AUC of 89%, with a sensitivity of 80% and a specificity of 75%.
Race also plays a critical role in discriminating between cancer and control cases. Santaliz-Casiano et al [
38] observed 9 metabolomics signatures exclusive for African American patients and 6 others for white individuals. Alpha ketoisocaproic acid, arginine, alpha tocopherol, citric acid, histidine, maltose, methionine, n-acetylglutamic acid, o-phosphoethanolamine, and oxalic acid, were statistically significant in the African American cohort (AUC of 79%), while β-hydroxybutyrate, cholesterol, oxalic acid, palmitic acid, palmitoleic acid, and tetra decanoic acid were strong indicators of disease in white individuals (AUC of 78%). The distinct signatures of amino acid metabolism in cancerous tumors across various subpopulation and tumor stages suggest that altered metabolic pathways in cancer are not solely driven by tumor biology but may also be influenced by genetic factors such as underlying gene expression, epigenetic modifications caused by diet and lifestyle, and genetic marker variations. This potential dependence highlights the need for stratified approaches to metabolic profiling and biomarker identification that consider these variables to improve the accuracy and generalizability of findings.
Subramani et al [
39] identified a distinct metabolic signature in cancer cells compared to healthy controls, characterized by elevated choline, and decreased glucose levels. This dysregulation likely supports the increased energy demands and rapid proliferation of cancer cells. Notably, the study also linked elevated serine levels to cancer cell division through its essential role in nucleic acid synthesis [
39]. Expanding on specific metabolite associations, Jobard et al [
40] identified ten plasma metabolites positively associated with BC risk in premenopausal women. These metabolites include histidine, glycerol, N-acetylcysteine, and ethanol, as well as other amino acids like leucine, ornithine, albumin, pyruvate, glutamate, and glutamine [
40]. Higher levels of glutamate in breast cancer patients suggest it plays an important role in fatty acids overproduction [
41,
42]. Histidine association with BC has been corroborated by Huang et al [
43]. Additionally, this team of researchers applied a neural network model on a panel of seven saliva biomarkers to predict the probability of being diagnosed as BrCa-positive breast cancer and attained an AUC of 86.5. The 7-metabolite panel consists of L-glyceric acid, nicotinamide, histamine, uracil, thymine, 3,4- dihydroxybenzyl amine and dehydro phenylalanine [
43].
Nicotinamide, a water-soluble form of vitamin B3, is overexpressed in triple-negative breast cancer patients. This overexpression was associated with increased lipid metabolism and energy disruption, suggesting its potential as an anti-tumor agent [
44]. A similar saliva-based biomarker study found elevated levels of polyamine and spermine in patients with breast cancer [
36]. Another metabolic panel associated with an increased risk of developing breast cancer identified high levels of valine/norvaline, glutamine/isoglutamine, 5-aminovaleric acid, phenylalanine, tryptophan, γ-glutamyl-threonine, ATBC, and pregnenetriol sulfate, alongside a concomitant decrease in plasma O-succinyl-homoserine levels, as statistically significant indicators of disease [
45].
The selected metabolites included DG(O-16:0/18:0), 1-butylamine, cytidine, histamine, phosphorylcholine, hydroxylinolenic acid, linoleic acid, glycerol 3-phosphate, glutamate, propenoyl carnitine, glutamine, tyrosine, 3-hydroxypalmitic acid, lysoPC(P-16:0), butyryl carnitine, pipecolic acid, lysoPC(18:2), N-acetyl spermidine, lactic acid, histidine, N-methyl histamine and N-acetyl histamine [
46]. Elevated levels of carnitine derivatives such as of L-carnitine, acetylcarnitine, acylcarnitine C3:0, acylcarnitine C4:0, acylcarnitine C5:0 and acylcarnitine in murine breast cancer models indicate that this metabolite may be related to the development of breast cancer Sun et al [
47].
Another LASSO-based regression model identified a panel of seven metabolites consisting of glutamine, ornithine, threonine, methionine sulfoxide, short-chain acylcarnitines C3, acetylcarnitine C2 and tryptophan, and reached an AUC of 80%. These significantly differentiated metabolites are mainly involved in the amino acid metabolism, aminoacyl-tRNA biosynthesis, and nitrogen metabolism [
48]. Tryptophan holds particular significance due to its interference with cellular and immune signaling pathways, impacting cell division, and suppressing anticancer immune responses [
49,
50]. Indoleacetylglutamine, a tryptophan derivate was under expressed in a breast cancer cohort analyzed by Dougan et al [
51]. further supporting tryptophan’s anti-cancer role. In addition to indoleacetylglutamine, this study observed a greater than 20% case-control difference in other 23 metabolites including 1-(1-enyl-palmitoyl)-2-oleoyl-GPC (P-16:0/18:1), 1-linoleoyl-GPA (18:2), 1-palmitoleoyl-2-linoleoyl-GPC (16:1/18:2), 1-palmitoyl-GPG (16:0), 1-palmitoylglycerol (16:0), 2-ethylphenylsulfate, 3-(cystein-S-yl)acetaminophen, 4-acetylphenol sulfate, adrenate (22:4n6), asparagine, cysteine s-sulfate, cysteinylglycine, ergothioneine, glycerate, glycolithocholate, heptanoate (7:0), indoleacetate, laurylcarnitine, maltotriose, N-(2-furoyl)glycine, sphingosine, sphingosine 1-phosphate and threonine [
51]. While over 100 potential biomarkers have been proposed in the literature reviewed for this paper, few are consistently replicated across multiple studies (
Figure 4).
Machine learning and deep learning (DL) models consistently outperform statistical regression in their ability to capture both linear and non-linear relationships within diverse variables. This advantage makes them compelling tools for analyzing biological data, particularly in the context of breast cancer screening and diagnosis. However, benchmarking these models using robust datasets is crucial to ensure consistency and reproducibility of insights gained. Integrated omics analysis, combining data from multiple molecular layers, holds promise for further enhancing robustness and accuracy. Strategies like cross-cohort data aggregation and cross-modality integration, as proposed by Jiang et al [
52], offer exciting avenues for uncovering novel insights into disease progression. Yet, data availability remains a significant bottleneck in implementing these approaches effectively. The potential of deep learning models in cancer diagnosis is rapidly expanding, particularly in analyzing imaging data. These models can assist pathologists in accurately predicting and labeling cancer samples, ultimately aiding oncologists in making informed treatment decisions. For instance, a recent study by Liu and Li [
53] demonstrated the effectiveness of DL models in lung cancer imaging analysis. Their findings highlight the significant role these models can play in assisting pathologists to screen and accurately detect cancerous tumors. Moreover, DL's potential extends beyond single datasets.
Prescreening diverse data sets like omics, X-rays, magnetic resonance imaging, and biopsies holds exciting possibilities for DL models. Gonzales Martinez and van Dongen [
54] explored this concept in their study on breast cancer, comparing machine learning and DL approaches in predicting the disease using prescreening information. Their DL model achieved an AUC of 87%, demonstrating its promising potential [
54]. On the other hand, Sultana and Jilani [
55] found that simple logistic regression emerged as the most powerful predictor, delivering highly accurate results. While alternative methods like K-Nearest Neighbors and instance-based classifiers were employed, none surpassed the effectiveness of logistic regression in this specific context. This disparity highlights the need for robust and diverse training and validation datasets across multiple domains such as combining metabolites with imaging data. Nonetheless, these combined efforts highlight the significant promise of DL and ML for cancer diagnosis and prediction [
56].
While further research is necessary to refine and validate these models, their ability to leverage diverse data sources and achieve high accuracy positions them as powerful tools for improving patient outcomes in the fight against cancer. Despite extensive research efforts devoted to the development of early diagnosis tools for breast cancer, the field currently lacks widely adopted, rapid, and non-invasive diagnostic methods readily implementable in routine clinical practice. To address this unmet need, the findings of the present paper demonstrate the potential of a machine learning model capable of achieving robust and accurate breast cancer detection using only 4 blood biomarkers and 2 demographic characteristics.