1. Introduction
Cancer is a disease which can start almost anywhere in the human body, in which some of the body’s trillion cells grow uncontrollably and spread to other parts of the body. There are over 200 types of cancer such as colon, liver, ovarian and breast etc. [
1,
2]. In 2023, 1,958,310 new cancer cases and 609,820 cancer deaths were projected in the United States. [
3]. This prompts a clear understanding of the underlying mechanism and characteristics of this potentially fatal disease alongside identifying the most significant genes responsible for it.
Cancer can alter the gene expression profile of the body cells. Therefore, microarray data is utilized in clinical diagnosis to recognize down or up the regulated gene expression, which is the reason for generating new biomarkers, and leading to cancer disease [
4]. Microarray data analysis has been a popular approach for diagnosing cancer, and DNA microarray is a technology used to collect data on large numbers of various gene expressions at the same time [
5,
6]. The classification and identification of gene expression using DNA microarray data is an effective tool for cancer diagnosis and prognosis for specific cancer subtypes. Gene expression analysis can assure medical experts whether a patient suffers from cancer within a relatively shorter time than traditional methods. Recently, its analysis has emerged as an important means for addressing the fundamental challenges associated with cancer diagnosis and drug discovery [
7,
8]. Analysis of gene expression data involves the identification of informative genes, [
9] and [
10] demonstrates that cancer classification can be improved by identifying informative genes which in turn can be used to accurately predict new sample classes.
Machine learning (ML) is a branch of artificial intelligence (AI) that enables computers to “self-learn” or obtain information from training data; recognize patterns in data and develop their own predictions, improving over time without being explicitly programmed [
11]. Medical researchers and clinicians are utilizing several ML techniques on medical data sets to construct an intelligent diagnosis system [
12]. Massive volume of data is being generated in the medical industry thanks to the digital revolution in information technology. ML techniques are highly suited for analyzing these massive data sets, and multiple algorithms have been used to diagnose various diseases [
13,
14,
15]. Numerous research has been done to classify cancer using microarray gene expression data. Golub et al. [
16] suggested a strategy based on expression profiles generated by microarrays. According to ML theory, classification outcomes are dependent on the features of the input set, the training algorithm, and the system's capacity to adapt to the original data. It is necessary to evaluate the behavior of various classifiers on provided data.
Recently, several classification approaches were created in the ML domain, and many of them were utilized in cancer classification [
17]. However, there are several difficulties possible to face in the microarray classification process like (a) The microarray genes expression data constitutes many highly correlated genes for just a small sample size. The small number of cancer samples compared with the number of features can degrade the performance of the classifier and increase the risk of over-fitting. (b) Various uncertainties associated with the process of acquiring microarray data, for example, fabrication, image processing etc., resulting in unexplained fluctuation in the data. (c) The majority of genes in the microarray date are redundant for classifying diverse tissue types [
18,
19].
The earliest detection of cancer is among the most efficient approaches to reduce cancer-related death [
20,
21,
22,
23]. The microarray's primary characteristic is its greater number of genes (p) in comparison to the number of tissues (n) [
24]. In most gene expression studies selection of relevant genes to differentiate between patients with and without cancer is a common task [
25,
26,
27,
28,
29,
30]. Due to overestimation & various linearity issues it is difficult to categorize high-dimensional microarray data (p > n) using statistical approaches [
31,
32]. There is no single optimal method to examine microarray data, with its continually evolving analysis methods [
33]. Various supervised and unsupervised ML techniques have also been adopted to identify the most significant genes [
34,
35,
36].In microarray gene expression analysis, gene selection or feature selection (FS) is utilized to improve cancer classification performance while using fewer samples, eliminate undesired & repetitive attributes from data and ultimately counter the curse of dimensionality by identifying the most informative genes to enhance disease prediction accuracy [
37,
38]. ML and dimensionality reduction techniques also perform exceptionally well at classifying biologic data [
39,
40,
41]. Hence it may be beneficial to use feature selection methods which can address the challenges arising from high data dimensionality and small sample size.
The remainder of this paper is structured in the following manner.
Section 2 discusses the related work.
Section 3 presents the materials and methods. In
Section 4, we present the experimental results. Finally, in
Section 5, we conclude the paper giving a discussion.
2. Related Work
ML can assist in automating intelligent processes, increasing development efficiency and accuracy, and lowering costs [
42]. Over the years ML-based classifiers have been widely used in classification of cancer sub-types. Several studies tried to assess whether ML can help in oncology care, by investigating the applications of ML in cancer risk stratification, diagnoses, and medication development [
17,
43,
44,
45]. According to those studies, ML can help in cancer prediction and diagnosis by analyzing pathology profiles and imaging studies.
BRCA (Breast Cancer gene) genes produce proteins that help repair damaged DNA and are referred to as tumor suppressor genes since certain changes in these genes can cause cancer [
46]. People born with a certain variant of BRCA tend to develop cancer at early ages. Chang, Dalpatadu, Phanord and Singh [
47] fitted a Bayesian Logistic Regression model for prediction of breast cancer using the Wisconsin Diagnosis Breast Cancer (WDBC) data set [
48] which was downloaded from the UCI Machine Learning Repository; precision, recall and F1-measures of 0.93, 0.89, and 0.91 were reported for the training data, and 0.87, 0.91, 0.89 for the test data, respectively.HER2 protein accelerates breast cancer cell growth and HER2 positive patients when treated with medicines which attack the HER2 protein. Gene expression patterns of HER2 are quite complex and pose a challenge to pathologists. Cordova et al. (2023) developed a new interpretable ML method in immunohistochemistry for accurate HER2 classification and obtained high precision (0.97) and high accuracy (0.89) using immunohistochemistry (IHC) and fluorescence in situ hybridization (FISH) data [
49].
Kidney renal cell carcinoma (KIRC) is the most prevalent type of kidney cancer, with a survival rate of less than 5 years and 338,000 estimated number of new cases each year [
50]. ICD profile of KIRC. Wang et al. (2023) correlated the immunogenic cell death (ICD) of KIRK with the heterogeneity and therapeutic complexity which is useful for developing optimal immunotherapy strategy for KIRC patients [
51].
A common cancerous tumor in the digestive track is colon adenocarcinoma (COAD) and is commonly associated with fatty acids [
52]; diagnosis of COAD is difficult as there are hardly any early symptoms. Li et al. (2017) used a genetic algorithm and the k-nearest neighbors clustering method to determine genes which can accurately classify samples as well as class subtypes for a TCGA RNA-seq dataset of 9066 cancer patients and 602 normal samples [
53].
Lung adenocarcinoma (LUAD) is a common form of lung cancer which also gets detected in the middle/late stages and therefore is hard to treat [
54]. Yang et al. (2022) used a dataset of gene expression profiles from 515 tumor samples and 59 normal tissues and split the dataset into two significantly different clusters; they further showed that using age, gender, pathological stages, and risk score as predictors of LUAD increased the prediction accuracy measures [
55]. Liu, Lei, Zhang, and Wang (2022) used cluster analysis on enrichment scores of 12 stemness signatures to identify three LUAD subtypes, St-H, St-M and St-L for six different datasets [
56].
Prostate adenocarcinoma (PRAD) is common in elderly men, and patients suffering from PRAD typically have good prognosis [
57]. Khosravi et al. (2021) used Deep Learning ML models on an MRI dataset from 400 subjects with suspected prostate cancer combined with histological data and reported high accuracies [
58].
PCA is an exploratory multivariate statistical technique for simplifying complex data sets [
59,
60,
61]. It has been used in a wide range of biomedical problems, including the analysis of microarray data in search of outlier genes [
62], analysis of other types of expression data [
63,
64] as well as cancer classification [
65]. AK Oladejo, TO Oladele, YK Saheed (2018) presented two methods of dimension reduction: feature extraction (FE) and FS; one-way Anova for FE and PCA was utilized for FS [
66]. The Support vector machine (SVM) and k-nearest neighbor (K-NN) were used for the classification of leukemia genome data. The obtained results gave an accuracy of 90% for SVM and 81.67% for K-NN.
MO Adebiyi, MO Arowolo, MD Mshelia, OO Olugbara (2022) applied the machine learning algorithms of RF and the SVM with the feature extraction method of LDA to the Wisconsin Breast Cancer Dataset [
67]. The SVM with LDA and RF with LDA yielded accuracy results of 96.4% and 95.6% respectively. Evidence from this study shows that better prediction is crucial and can benefit from machine learning methods. This research has validated the use of feature extraction in predicting a diagnostic system for breast cancer when compared to the existing literature.
Ak, Muhammet Fatih (2020) utilized the Wisconsin Breast Cancer Dataset [
48] for the comparison of most of the major machine-learning procedures for detection and diagnosis [
69]. Supervised learning-decision tree, RF, multilayer perception, SVM, and linear regression (LR) were compared in both the classification and regression categories. The results revealed that under the classification algorithm, the SVM provides high accuracy; however, under the regression methodology, multilayer perception regression delivers reduced errors. Díaz-Uriarte, Ramón (2006) investigated the implication of RF for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on RF [
70]. The study used simulated and nine microarray data sets and demonstrated that random forest has comparable performance to other classification methods, including diagonal discriminant analysis (DLDA), KNN, and SVM, and that the new gene selection procedure yields very small sets of genes without compromising predictive accuracy.
AC Tan, D Gilbert (2003) classified cancer using gene expression data using three distinct tree-based supervised ML techniques [
71]. Seven different categories of cancer data were classified using bagged and boosted decision trees (DT) alongside C4.5 DT. The bagging DT outperforms the other two. A Sharma, S Imoto, S Miyano, V Sharma (2012) proposed a Null space-based feature selection method for gene expression data in terms of supervised classification. [
72]. Scatter matrices-generated null space information were utilized as a feature selection method in removing the duplicate gene expressions. After effectively lowering the dimension of the features, classification was performed using three different types of classifiers: SVM, naïve Bayes (NB), and LDA.
Degroeve, De Baets, Van de Peer and Rouz´e (2002) created a balanced train and set by randomly selecting 1000 positive instances and 1000 negative and created a test data with 281 positive and 7505 negative instances and another test data set with 281 positive and 7643 negative instances; they used SVM classifier, a NB classifier, and a traditional method for feature selection for predicting splice site and obtained improved performance. Precision obtained for these datasets ranged in 93-98% range, but the recall and F1-measures were in 25-49% range [
73]. Peng, Li and Liu (2006) compared various methods of gene selection over four microarray gene expression datasets and showed that the hybrid method works well on the four datasets [
74].
Sharma and Paliwal (2008) used Gradient LDA method for three small microarray gene expression datasets: acute leukemia, small round blue-cell tumor (SRBCT) and lung adenocarcinoma and have obtained higher accuracies than some competing methods [
75]. Bar-Joseph, Gitter and Simon (2012) provided a discussion of how time-series gene expression data is used for identification of activated genes in biological processes and describe how basic patterns lead to gene expression programs [
76]. Cho et al. (2004) proposed a modified kernel Fisher discriminant analysis (KFDA) for the analysis of the hereditary breast cancer dataset [
77]. The KFDA classifier employed the mean-squared-error as the gene selection criterion. D Huang (2009) evaluated the classification performance of LDA, prediction analysis for microarrays (PAM), shrinkage centroid regularized discriminant analysis (SCRDA), shrinkage linear discriminant analysis (SLDA) and shrinkage diagonal discriminant analysis (SDDA) by applying these methods to six public cancer gene expression datasets [
78].
Dwivedi (2018) used the method of Artificial Neural Network (ANN) for classification of acute cases of lymphoblastic leukemia and myeloid leukemia and reported over 98% overall classification accuracy [
79]. Sun et al. (2019) used the genome deep learning method to analyze 6,083 samples from the Whole Exon Sequencing mutations with 12 types of cancer and 1991 non-cancerous samples from the 1000 Genome Project and obtained overall classification accuracies ranging in 70% - 97% [
80]. A survey of feature selection literature for gene expression microarray data analysis based on a total of 132 research articles [
81] was conducted by Alhenawi, Al-Sayyed, Hudaib and Mirjalili (2022). Khatun et al. (2023) developed an ensemble rank-based feature selection method (EFSM) and a weighted average voting scheme to overcome the problems posed by high dimensionality of microarray gene expression data [
82]. They obtained overall classification accuracies of 100% (leukemia), 95% (colon cancer), and 94.3% for the 11-tumor dataset. Osama, Shaban and Ali (2023) have provided a review of ML methods for cancer classification of microarray gene expression data; data pre-processing and feature selection methods including filter, wrapper, embedded, ensemble, and hybrid algorithms [
83].
Kabir et al. (2023) compared two different dimension reduction techniques—PCA, and autoencoders for the selection of features in a prostate cancer classification analysis. Two machine learning methods—neural networks and SVM—were further used for classification. The study showed that the classifiers performed better on the reduced dataset [
84]. Another study Adiwijaya et al. (2018) utilized PCA dimension reduction method that includes the calculation of variance proportion for eigenvector selection followed by the classification methods, SVM and Levenberg-Marquardt Backpropagation (LMBP) algorithm. Based on the tests performed, the classification method using LMBP was more stable than SVM [
10].
Kharya, S., D. Dubey, and S. Soni (2013) compared the accuracy of the SVM, ANN, Naive Bayes classifier, and AdaBoost tree to identify a potent model for breast cancer prediction as observational research [
85]. PCA was used to reduce dimensionality. The study found that, when compared to techniques like decision trees, regression trees, and so on, ANN came out to be the one with the most reliable approach in making real-time predictions and prognoses. Rana et al (2015) used machine learning classification algorithms, which use stored historical data to learn from and forecast new input categories, benign and malignant tumors [
86]. According to this study, the random forest model demonstrated the highest accuracy of 96% to detect different cancers.
Based on previous research, the general scheme in the process of classification of microarray data for the detection of cancer can be conducted via preprocessing the data and dimensionality reduction followed by cancer classification.
4. Results
PCA was run on the entire 801 rows x 20531 genes data set, and trial-and-error showed that just the first two principal components were sufficient for classification purposes. The genes with highest absolute loadings are shown in
Table 9.
A scatterplot of the first two PC-scores for the entire dataset is shown in
Figure 1. A clear separation between BRCA and KIRC cancer sub-types with some overlap between COAD, LUAD and PRAD is seen in
Figure 1.
Figure 1.
Scatterplot of PC2 vs PC1 for the Entire Data.
Figure 1.
Scatterplot of PC2 vs PC1 for the Entire Data.
Accuracy Measures for the LDA Classifier for Training Data
Figure 2.
Confusion Matrix Plot for the LDA Classifier – Training Data.
Figure 2.
Confusion Matrix Plot for the LDA Classifier – Training Data.
Table 1.
Precision, Recall, F1 and AUC Measures for the LDA Classifier – Training Data.
Table 1.
Precision, Recall, F1 and AUC Measures for the LDA Classifier – Training Data.
Table 2.
Macro- and Micro-Averaged AUC Measures for the LDA Classifier – Training Data.
Table 2.
Macro- and Micro-Averaged AUC Measures for the LDA Classifier – Training Data.
Accuracy Measures for the LDA Classifier for Test Data
Figure 3.
Confusion Matrix Plot for the LDA Classifier – Test Data.
Figure 3.
Confusion Matrix Plot for the LDA Classifier – Test Data.
Table 3.
Confusion Matrix Plot and Accuracy Measures for the LDA Classifier – Test Data.
Table 3.
Confusion Matrix Plot and Accuracy Measures for the LDA Classifier – Test Data.
Table 4.
Macro and Micro averaged AUC for the LDA Classifier – Test Data.
Table 4.
Macro and Micro averaged AUC for the LDA Classifier – Test Data.
Accuracy Measures for the RF Classifier for Training Data
Figure 4.
Confusion Matrix Plot for the RF Classifier – Training Data.
Figure 4.
Confusion Matrix Plot for the RF Classifier – Training Data.
Table 5.
Precision, Recall, F1 and AUC Measures for the RF Classifier – Training Data.
Table 5.
Precision, Recall, F1 and AUC Measures for the RF Classifier – Training Data.
Table 6.
Macro and Micro averaged AUC for the RF Classifier – Training Data.
Table 6.
Macro and Micro averaged AUC for the RF Classifier – Training Data.
Accuracy Measures for the RF Classifier for Test Data
Figure 5.
Confusion Matrix Plot for the RF Classifier – Test Data.
Figure 5.
Confusion Matrix Plot for the RF Classifier – Test Data.
Table 7.
Precision, Recall, F1 and AUC Measures for the RF Classifier – Test Data.
Table 7.
Precision, Recall, F1 and AUC Measures for the RF Classifier – Test Data.
Table 8.
Macro and Micro averaged AUC for the RF Classifier – Test Data.
Table 8.
Macro and Micro averaged AUC for the RF Classifier – Test Data.
In
Table 9 we provide the variables (genes) with high absolute loadings on the first two PC-scores; such a table can be very useful for selection of features (genes).
Table 9.
Significant genes with highest absolute loadings on the first two PC-scores.
Table 9.
Significant genes with highest absolute loadings on the first two PC-scores.