1. Introduction
Breast cancer is the most common cancer in women [
1]. In 2020, approximately 2.3 million female breast cancer patients were diagnosed, accounting for 11.7% of new cancer cases. Breast cancer has not only become the main cause of global cancer but is also the fifth leading cause of cancer deaths worldwide, accounting for 1 in 6 cancer deaths [
2,
3]. To make matters worse, it has been predicted that the worldwide incidence of breast cancer is rising and that approximately 3.2 million new cases of female breast cancer will be diagnosed per year by 2050. These numbers indicate the urgent need for prevention and treatment strategies for breast cancer. Breast cancer commonly occurs in ducts or lobules. In addition to invading the original organs (breasts), malignant breast cancer has the ability to metastasize to distant organs such as bones, lungs, liver, and brain [
4], which can lead to disease progression and eventually death in severe cases. Therefore, researchers continue to search for breakthroughs in the diagnosis, treatment and palliative care of breast cancer. Especially in palliative care, reliable and accurate prognostic prediction plays a key role in decision-making regarding medical strategies [
5].
Medical treatments should be decided based on the patient's goals and expected survival time, the potential benefits and risks of treatment, and the effects on quality of life. Therefore, a comprehensive consideration of these factors determines treatment choices [
6]. To predict patient survival time, many features, including pathogenesis, gene mutation, gene expression, clinical data, treatment, and general health, are typically considered for prognostic predictions [
7,
8]. Therefore, multiple predictors will be used in the model design and data analysis to determine the important features of the prognostic model. To date, researchers have proposed different combinations of predictors for survival analysis or death probability scoring or when developing prediction tools or analysis platforms for prognosis. These tools are often called prognostic models, predictive models, or risk scores [
9,
10,
11,
12,
13,
14,
15,
16]. Increasing the accuracy of these prognostic models or risk scores can help patients in making medical treatment decisions and providing more reliable survival analyses. In the postgenomic era, the significant features are not limited to clinical information, and the gene expression profiles of patients are also a crucial factor affecting prognosis [
17,
18,
19].
To analyze gene expression, protein-coding RNAs (mRNAs) and noncoding RNAs, including long noncoding RNAs (lncRNAs), snRNAs, rRNAs, tRNAs, and microRNAs (miRNAs), were considered as candidates [
20,
21,
22,
23]. With the launch of the Human Genome Project [
24] and the advancement of next-generation sequencing technologies, more high-throughput RNA-seq data from cancer patients has become available for bioinformatics analyses [
25]. However, the analysis of such large datasets has often previously been limited by hardware capabilities [
26]. With advancements of hardware and the development of deep learning architecture, more studies have applied deep learning from the information domain to bioinformatics [
27] and hope to use the characteristics of deep learning to learn and extract features from genes or RNA-seq data to train and build models [
28,
29,
30]. Compared with the complexity and diversity of genomic features, the number of samples from cancer patients from which RNA-seq data are available is limited. When the number of features is larger than the number of samples, model-overfitting tends to occur, which will reduce the accuracy of prediction in test data [
31]. In addition, limited availability of clinical data also affects the effectiveness of deep learning. The hospital’s inability to actively track patients leads to loss to follow-up and censored death times for some patients. This incomplete clinical information may be the main limitation of cancer prognosis prediction [
32]. For example, in the TCGA-BRCA database, the most common event date recorded is the last follow up date, not the date of death of the patient. This may be the key factor affecting the accuracy of previous studies. Therefore, we excluded this kind of data to improve the accuracy of the prediction model and then used data dimension raising and age stratification strategies to build a breast cancer patient survival analysis model SaBrcada by deep learning.
First, we downloaded the RNA-Seq and clinical data from the TCGA-BRCA database and conducted data screening. TCGA-BRCA provides the RNA-Seq data in fragments per kilobase per million (FPKM); FPKM is applicable to paired-end RNA-seq experiments only. As third-generation sequencing technologies have developed, such as single-molecule real-time sequencing (SMRT) and Oxford Nanopore’s technology, a widely applicable normalization method for different sequencing platforms is needed for survival analysis model construction. Transcripts per million (TPM) represents the relative expression level of a transcript, and the sum of all TPM values is a million in all samples. In principle, TPM should be comparable between samples; thus, we normalized the gene expression data from FPKM into TPM. Considering the correlation among gene expression levels, deep learning was selected for model construction. To process the data for CNN learning, we used a dimension raising strategy to raise the gene expression data into a matrix and then subtracted the data in pairs to generate a differential gene expression image (survival analysis image). We developed a survival analysis model by using a convolutional neural network with 8 different architectures. Among them, GoogLeNet exhibited the best performance. Patient age was also reported to be an important feature that affects survival time [
3]. To test the effectiveness of the age stratification strategy, the data of breast cancer patients were grouped based on quantiles of age, from
to
. The results showed that the age stratification at 61 years old has the best performance, which is in agreement with the median age at the time of breast cancer diagnosis reported by the American Cancer Society [
33]. For clinicians’ reference, we also established a free website tool (
http://ncblab.nchu.edu.tw/SaBrcada), named SaBrcada, which provides 5 types of predicted survival intervals, including within half a year, between half and one year, between one and three years, between three and five years, and more than five years.
Figure 1.
SaBrcada modeling process. RNA-seq and clinical data of breast cancer patients from downloaded from TCGA-BRCA has first been filtered to exclude records with incomplete RNA-Seq expression data or missing clinical data, death dates, or age information. After converting the RNA-seq data into TPM format, it was split into two subsets based on the age of 61, and 70% of the data in each subset was used for training. Through dimension raising, survival analysis images were generated and used for deep learning modeling. Finally, the remaining 30% of the data was used as test data to verify the accuracy of the model.
Figure 1.
SaBrcada modeling process. RNA-seq and clinical data of breast cancer patients from downloaded from TCGA-BRCA has first been filtered to exclude records with incomplete RNA-Seq expression data or missing clinical data, death dates, or age information. After converting the RNA-seq data into TPM format, it was split into two subsets based on the age of 61, and 70% of the data in each subset was used for training. Through dimension raising, survival analysis images were generated and used for deep learning modeling. Finally, the remaining 30% of the data was used as test data to verify the accuracy of the model.
Figure 2.
Data screening flowchart. The flowchart details how much data were deleted at each stage and why. From TCGA, 1187 samples were downloaded to construct SaBrcada-BPP, before preprocessing. After excluding 96 samples lacking clinical data, and further excluding 284 samples with the same survival time as other samples, we then built SaBrcada-APP dataset containing 807 breast cancer cases after preprocessing. Finally, 663 samples without death date were removed, and we obtained 144 samples with actual death date to build the SaBrcada-AD dataset.
Figure 2.
Data screening flowchart. The flowchart details how much data were deleted at each stage and why. From TCGA, 1187 samples were downloaded to construct SaBrcada-BPP, before preprocessing. After excluding 96 samples lacking clinical data, and further excluding 284 samples with the same survival time as other samples, we then built SaBrcada-APP dataset containing 807 breast cancer cases after preprocessing. Finally, 663 samples without death date were removed, and we obtained 144 samples with actual death date to build the SaBrcada-AD dataset.
Figure 3.
Survival analysis data generation. (a) The survival analysis data generation method. (positive) is the data type that was generated by subtracting the TPM data of patients with shorter survival times from that of patients with longer survival times (negative) was generated by subtracting the TPM data of patients with longer survival times from that of patients with shorter survival times. (b) Schematic diagram of survival analysis data example. N1 and N2 indicate the gene expression of patients N1 and N2 in TPM format, respectively. Data Type is the survival analysis data generated by subtracting the TPM data of patient N2 from that of N1.
Figure 3.
Survival analysis data generation. (a) The survival analysis data generation method. (positive) is the data type that was generated by subtracting the TPM data of patients with shorter survival times from that of patients with longer survival times (negative) was generated by subtracting the TPM data of patients with longer survival times from that of patients with shorter survival times. (b) Schematic diagram of survival analysis data example. N1 and N2 indicate the gene expression of patients N1 and N2 in TPM format, respectively. Data Type is the survival analysis data generated by subtracting the TPM data of patient N2 from that of N1.
Figure 4.
Schematic diagram of survival analysis images. By dimension raising and scaling the survival analysis data in the range from 0 to 255, a survival analysis matrix was generated for further survival analysis image conversion.
Figure 4.
Schematic diagram of survival analysis images. By dimension raising and scaling the survival analysis data in the range from 0 to 255, a survival analysis matrix was generated for further survival analysis image conversion.
Figure 5.
Pixel distribution diagram after image generation. (a) type image; (b) pixel value distribution of type image; (c) type image; (d) pixel value distribution of type image.
Figure 5.
Pixel distribution diagram after image generation. (a) type image; (b) pixel value distribution of type image; (c) type image; (d) pixel value distribution of type image.
Figure 6.
Performance of stratified random sampling by age. The X axis is the age cut-off, and the Y axis is the accuracy.
Figure 6.
Performance of stratified random sampling by age. The X axis is the age cut-off, and the Y axis is the accuracy.
Figure 7.
The prediction accuracy for breast cancer patients using SaBrcada by age. The X-axis is the age of the patient, and the Y-axis is the accuracy. The red dots indicate that the accuracy is greater than 0.7.
Figure 7.
The prediction accuracy for breast cancer patients using SaBrcada by age. The X-axis is the age of the patient, and the Y-axis is the accuracy. The red dots indicate that the accuracy is greater than 0.7.
Figure 8.
SaBrcada website tool interface. The tool is freely available at
http://ncblab.nchu.edu.tw/SaBrcada. It provides a tool for generating survival analysis images and online analysis of survival time. The outcome of the analysis is the patient's predicted survival time, which can be classified as less than six months, six months to one year, one to three years, three to five years, or more than five years.
Figure 8.
SaBrcada website tool interface. The tool is freely available at
http://ncblab.nchu.edu.tw/SaBrcada. It provides a tool for generating survival analysis images and online analysis of survival time. The outcome of the analysis is the patient's predicted survival time, which can be classified as less than six months, six months to one year, one to three years, three to five years, or more than five years.
Table 1.
List of the datasets used in this study.
Table 1.
List of the datasets used in this study.
Dataset |
No. |
Age at index, median (range) |
Survival day, median (range) |
Race no. (%) (W, BAA, A, AIAN, NR) *
|
SaBrcada-BPPa
|
1187 |
58 (26,90) |
912 (-7,8605) |
753 (68%), 182 (16%), 61 (5%), 1 (0.09%), 94 (9%) |
SaBrcada-APPb
|
807 |
57 (26,90) |
1026 (0,8605) |
583 (72%), 141 (17%), 34 (4%), 1 (0.1%), 48 (5%) |
SaBrcada-ADc
|
144 |
58 (25,90) |
1163 (0,7455) |
106 (74%), 30 (21%), 2 (1%), 0 (0%), 6 (4%) |
SaBrcada-AYT61d
|
84 |
46 (25,58) |
1439 (227, 7455) |
51 (74%), 15 (22%), 1 (1%), 0 (0%), 2 (3%) |
SaBrcada-AOT61e
|
60 |
69 (54,90) |
1004 (0,4267) |
55 (73%), 15 (18%), 1 (3%), 0 (0%), 4 (5%) |
SaBrcada -trainf
|
103 |
58 (25,90) |
1032 (0, 7455) |
77 (74%), 19 (18%), 2 (2%), 0 (0%), 5 (7%) |
SaBrcada -testg
|
41 |
58 (27,85) |
1692 (158, 3926) |
29 (71%), 11 (27%), 0 (0%), 0 (0%), 1 (2%) |
Table 2.
Comparison among different Convolutional Neural Network Architecture.
Table 2.
Comparison among different Convolutional Neural Network Architecture.
Architecture |
Accuracy |
Batch Size |
Epoch |
Resnet18 |
0.50 |
8 |
50 |
0.49 |
16 |
100 |
0.50 |
32 |
150 |
Resnet50 |
0.50 |
8 |
50 |
0.50 |
16 |
100 |
0.50 |
32 |
150 |
Resnet101 |
0.50 |
8 |
50 |
0.50 |
16 |
100 |
0.50 |
32 |
150 |
Resnet152 |
0.50 |
8 |
50 |
0.49 |
16 |
100 |
0.50 |
32 |
150 |
ResNext101 |
0.50 |
8 |
50 |
0.50 |
16 |
100 |
0.50 |
32 |
150 |
GoogLeNet*
|
0.55 |
8 |
50 |
0.50 |
16 |
100 |
0.60 |
32 |
150 |
DenseNet121 |
0.55 |
8 |
50 |
0.54 |
16 |
100 |
0.54 |
32 |
150 |
DenseNet161 |
0.55 |
8 |
50 |
0.55 |
16 |
100 |
0.53 |
32 |
150 |
Table 3.
Comparison of SaBrcada with other breast cancer survival analyses.
Table 3.
Comparison of SaBrcada with other breast cancer survival analyses.
Model |
Number of Cancer |
Type of Data |
Patient Number |
Method |
C-index* /Accuracy†
|
SaBrcada-APP-M |
1a
|
mRNA |
807c
|
GoogLeNet |
0.500†
|
SaBrcada-AD-M |
1a
|
mRNA |
144c
|
GoogLeNet |
0.600†
|
SaBrcada-ASYT61-M |
1a
|
mRNA |
84c
|
GoogLeNet |
0.500†
|
SaBrcada-ASOT61-M |
1a
|
mRNA |
60c
|
GoogLeNet |
0.681†
|
SaBrcada |
1a
|
mRNA |
144c
|
GoogLeNet |
0.798†
|
VAECox (2019) |
10b
|
mRNA |
6127d
|
VAE, Cox |
0.649*
|
SALMON (2020) |
1a
|
mRNA, miRNA–target interactions |
626c
|
Cox |
0.700*
|
ConcatAE (2020) |
1 a
|
DNA methylation, miRNA |
1060e
|
ConcatAE |
0.641*
|