1. Introduction
Prostate cancer is the second most common cancer and the fifth leading cause of cancer death among men, with almost 1.3 million new cases and 359,000 associated deaths in 2018 worldwide [
1]. Genetic instability is one of the hallmarks of cancer cells. This is happened both by single point mutations or at the chromosomal abnormality. However, a few of them, called drivers, contribute to oncogenesis, whereas the majority are passenger mutations accumulated during cancer progression. Systematic identification of driver genes from large background noise is important. In this study, we identify putative genomic variants associated with the increased risk of cancer susceptibility from large background noises to provide an appropriate list of genes with potential impact on PC progression.
Identification of cancer-associated genomic variants has been focused on both protein-coding and non-coding genes. For example, Functional Analysis through Hidden Markov Models (FATHMM) [
2] was used to prioritize genomic variants in the protein-coding genes. However, most of the genome is in non-coding regions including non-coding RNAs and non-annotated regions and the majority (> 90%) of genomic variants occur in these regions [
3] Thus, determining the effect of genomic variants in non-coding regions is necessary. To this aim, there are computational tools that link genomic variants to different regulatory elements obtained from international projects such as ENCyclopedia of DNA Elements (ENCODE), Functional Annotation of the Mammalian Genome (FANTOM), Roadmap Epigenomics Project, and Genotype-Tissue Expression (GTEX). For instance, FunSeq2 [
4] was designed to identify and prioritize non-coding somatic point mutations using various resources including ENCODE and other publications [
5]. This pipeline firstly assigns a score to genomic variants based on overlapping these genomic variants with different genomic features including regulatory elements (enhancer marks H3K4me1 and H3K27ac, DNA methylation), network of genomic variants associated with genes, and recurrent elements across cancer samples (i.e., those variants identified on whole genome sequencing of at least two samples). Then, FunSeq2 assigns specific weight to features based on the 1-Shannon entropy. RegulomeDB is another tool [
6] that was designed to prioritize disease-associated Single Nucleotide Polymorphisms (SNPs). This method employs a heuristic scoring system that assign a specific score for each SNPs based on the number of overlapping between SNPs and integrated regulatory database including TFBSs, chromatin states of different cell types and eQTL data. Chen et al. [
7] also developed an enrichment analysis to test whether any risk-associated SNPs are located in the functional genomic regions including UCSC annotated coding regions (exon and snoRNA/miRNA) and regulatory regions, as well as binding regions for transcription factors (TFs), histone modifications (HMs), DNase I hypersensitivity (DHSs), and RNA Polymerase IIA (POLR2A) more than expected. RegulemeDB, HAploReg and Variant Effect Predictor (VEP) toolsets also map GWAS SNPs to regulatory elements to identify functional GWAS variants [
8].
There is another category of methods using machine learning techniques to predict the potential impact of genomic variants. These methods are supervised methods, which have been trained using functional annotations to determine pathogenic variants. and then, new genomic variants can be classified using this information. For example, DeepSEA (deep learning based sequence analyzer) [
9] uses a convolutional neutral network (CNN) based framework to predict the effect of chromatin factors (transcription factor binding, DNase I sensitivity, histone mark profile) in genomic sequence. In the prioritization part, DeepSEA predicts regulatory mutations using boosted logistic classifiers via eQTL data, non-coding trait-associated SNPs identified in GWAS studies from the US National Human Genome Research Institute’s GWAS Catalog. Chengliang et al. also presented iCAGES (integrated CAncer GEnome Score) [
10], a statistical framework that prioritizes cancer driver mutations, genes, and targeted drugs. This method firstly integrates different prioritization tools (FunSeq2, SIFT, FATHMM, VEST, Mutation Taster, Phylop, PolyPhen2, GERP++, Mutation Assessor, LRT, SiPhy, and LRT) to determine specific score into protein-coding mutations, non-coding mutations, and structural variations. Then, in the second layer, iCAGES takes the relevant genes from the last step and gene list from Phenolyzer tool to provide the score to each gene based on a logistic regression model. Lastly, this method links identified genes to specific drugs and calculates specific score for each drug, based on its effectiveness. Shengcheng et al. also presented SURF (Score of Unified Regulatory Features) [
11] that uses features from RegulomeDB and DeepSEA tools and then apply a random forest model to predict the effect of genomic variant (SNP) in promoters and enhancers regions.
The above-mentioned methods determine the overlapping of genomic variants in coding and non-coding regions, however, they are not able to identify the potential impact of the variants and how these variants affect gene expression. Integrative analyses have been used previously in cancer biomarkers discovery [
12], however, none of these platforms integrate chromosome confirmation capture data to identify the impact of regulatory variants in PC. Here, we develop a new integrative pipeline, Associated Genomic Variants (AGV), that uses high-throughput chromosome confirmation capture data (Hi-C), RNA-Seq, ChIP-Seq, and a list of genetic variants to link the variants to target genes in prostate cancer. We applied AGV on genomic variants of 194 PC patients obtained from International Cancer Genome Consortium (ICGC) and PC-associated GWAS SNPs from GWAS Catalog and identified the candidate coding and non-coding variants and their associated target genes.
To do this, AGV first identifies PC-associated somatic point mutations hotspot regions and CNV regions (genomic regions that CNVs are overlapping – CNVRs) and coding and non-coding genes affected by these variants. AGV then uses H3K27ac ChIP-Seq marks to identify variants that happened in the enhancer regions. Using Hi-C interactions from normal and cancer cell lines, AGV generates a list of genetic variants with potential regulatory function. Lastly, we validated the PC-associated variant identified in this study using an independent whole genome sequencing data from the same PC cell line. An overview of AGV pipeline is provided in
Figure 1.
The main novelties and contributions of our work are as follows:
This is the first study that comprehensively considers GWAS SNPs, somatic point mutations and CNVs, while the previous methods only considered somatic mutations and GWAS SNPs to identify functional cancer-associated variants.
In comparison to other studies [
2] which mainly considered genomic variants in protein-coding genes, in this study, we performed the analysis on both coding and non-coding regions.
Most of methods that determine associated genomic variants in non-coding regions such as FunSeq2 [
4], DeepSEA [
9], RegulomeDB [
6] and SURF [
11] are developed for general diseases and may not work well for a specific cancer.
We used an innovative strategy to identify hotspot somatic point mutation regions, which can also be used in the further studies for identifying hotspot regions in cancer. The proposed method is built upon window analysis for detection of hotspot somatic mutation regions, which is an effective strategy for identification of hotspot regions, whereas other methods such as FunSeq2 [
2] and iCAGES [
10] were not to report highly mutable regions.
4. Materials and Methods
GWAS dataset
GWAS SNPs were downloaded from GWAS Catalog (
https://www.ebi.ac. uk/gwas/docs/file-downloads) and GWASdb v2 (
http://jjwanglab.org/gwasdb). We only considered those SNPs that were associated with Prostate cancer. All GWAS SNPs with
were excluded from the analysis.
Somatic point mutations dataset
The genomic coordinates of Somatic point mutation (SPM) for prostate cancer were obtained from International Cancer Genome Consortium (ICGC) [
40]. Totally, there were 10,154,740 SPMs from 1,037 PC patients across six projects (PRAD-US, PRAD-CA, PRAD-UK, EOPC-DE, PRAD-CN and PRAD-FR) from United States, United Kingdom, Canada, Germany, China and France.
Identification of somatic point mutation hotspots
To identify somatic point mutation hotspots, our pipeline firstly counts the mutation recurrence for fixed binned size regions (bin length=21bp). The user can set the window length based on the desired minimum recurrence frequency. The P-value of mutation recurrence is computed using a Poisson binomial distribution model to determine the significance of observing k samples containing somatic mutations in a 21bp window. Skewness-kurtosis graph and CDF plot were executed by “fitdistrplus” in R package. In the next step, the problematic hotspot regions, such as masked regions (regions with mappability score < 1 in the ENCODE 75mers alignability track in the UCSC genome browser) and Repetitive regions (RepeatMasker track and simpleRepeat tracks in the UCSC Genome browser) [
41] were excluded. We also excluded chromosome Y in our analysis.
PeakCNV
To determine CNV regions (genomic region that CNVs are overlapping - CNVRs) that are associated to disease, we proposed an AI-based method calling PeakCNV, which is an extension of SNATCNV toolset [
42].
PeakCNV selects CNVRs with the lowest confounding with true positive CNVRs. To this aim, PeakCNV has three main steps: including CNVR map building, clustering process and selection process. In the first step, it builds deletion and duplication CNVR maps for case and control, independently, then, it selects CNVRs that are significantly represented in cases over controls at nucleotide base. In the next step, it groups significant CNVRs into different clusters based on the similar association of CNVRs with the phenotype of interest. To this aim, we used DBSCAN clustering algorithm with two input features including, CNVR uniqueness (the number of case samples covered by a given CNVR after subtracting the common case samples between each pair of CNVRs) and the genomic distance between CNVRs. Lastly, it selects the most independent CNVRs from each cluster using a novel score IR-score. Independent CNVRs are those detected in the greatest number of cases and having a minimum co-occurrence with other CNVRs. PeakCNV runs with the default parameters (P-value <0.05).
Reference gene annotations
FANTOM5 [
43], Ensemble [
44] and GENCODE [
45] gene annotation files were used to curate a comprehensive reference gene list. The FANTOM5 gene annotation file was used as the backbone of our reference gene list, but when the gene annotation was absent from FANTOM5 these were acquired from Ensembl and GENCODE. The final reference gene list contained 82,539 genes including 58,000, 24,501 and 38 genes from FANTOM5, Ensembl and GENCODE, respectively. The genomic coordinates for CNVRs, somatic point mutations, GWAS SNPs and gene annotations were in hg19 genome assembly.
Identification of genomic variants affecting coding and non-coding genes
This analysis is performed to indicate which genes are affected by the observed genomic variants in prostate cancer including GWAS SNP (1,992 SNPs), hotspot regions (71 regions), and CNVRs (duplication: 216 CNVRs, deletion: 75 CNVRs). Bedtools v2.30.0 [
46] was used to identify the overlapping between the genomic coordinates of genomic variants and genes [
46]. The risk SNPs, hotspot regions, and CNVRs were used for this analysis are provided in
Supplementary Table 1. The list of genes affected by different types of genomic variants is also provided in
Supplementary Table 2.
Preparation of Hi-C libraries
Hi-C data from Normal human prostate epithelia cells (PrEC) prostate cancer cell lines PC3 and LNCaP with GEO GSE73785 were downloaded using KARAJ toolset [
47] from previously published data [
48]. We used KARAJ [
47] to download datasets and the
supplementary files. Two replicates were available for each cell line. We used HiC-Pro v2.11 [
49] with the default parameters for analyzing and aligning Hi-C data in 5kb fragment size. Hg19 genome building was uses form mapping. We then used MaxHiC [
20] and MHiC identify statistically significant cis-interactions. Here, we only considered those significant cis-interactions with P-value < 0.01, read-count >= 10, and distance between two sides of interaction more 5k and less than 10M. We then used our genes list to annotate Hi-C interactions with coding and non-coding genes. At least 10% overlap between gene and Hi-C fragments been considered to annotate Hi-C fragments with genes. Two replicates of each Hi-C cell line were merged to enhance the statistical power (
Supplementary Table 3).
Identification of H3K27ac ChIP-Seq peak regions
H3K27ac ChIP-Seq fastq files PC3, LNCaP and PrEC cell lines were downloaded from GEO GSE57498, GSE73785, GSE57498, respectively [
48,
50]. Bowtie2 [
51] were then used to map the fastq file to hg19 human genome reference. Peaks were then called using Model-based Analysis of ChIP-Seq (MACS2) [
52] with the
(
Supplementary Table 9).
Literature search strategy
Our literature searches were focused on human and mice English language papers available in the PubMed, Scopus, and Web of Science. We also used data and text mining techniques to extract additional related studies [
53,
54,
55,
56,
57,
58,
59,
60,
61,
62,
63,
64,
65,
66,
67,
68]. A knowledge-based filtering system technique has been also used to categorize the texts from the literatures search [
69,
70,
71,
72,
73,
74]. The search terms included "Cancer", "Prostate cancer", "noncoding RNA", "enhancer", "CNV", "mutation", "copy number variations".
Whole genome sequencing data processing
-
a.
Mapping of fastq reads of prostate cell lines to reference genome
We obtained WGS data of LNCaP (ATCC CRL-1740) and PC3 (ATCC CRL-1435) from published work [
75] with Karaj pipeline. The quality checking of fastq files were performed using FastQC v0.11.9 [
76]. Trimmomatic v0.40 [
77] was then used to filter poor quality reads and trim poor quality bases (phred score < 30) from our samples. BWA-MEM v0.7.17 (r1188) [
78] was then used to map sequencing reads to the human reference genome (hg19) and a sorted BAM file was generated by SAMtools v1.12 [
79].
-
b.
Variant calling
To call single nucleotide polymorphisms (SNP) and short indels from the bam files, SAMtools v1.12 mpileup and bcftools [
80] were used to interrogate indexed BAM files of reads aligned to the reference genome and generate a VCF (Variant Call Format) file of SNP and short indel variants. Variant files (VCF) were next filtered using bcftools with the following parameters: QUAL <=30 && DP <=10; where QUAL denotes minimum variance confidence and DP total depth threshold. Control-FREEC v11.6 pipeline [
81] were also used to call copy number variations from the sorted BAM files and generate duplicated and deleted variants.
Data visualization
To visualize the impact of regulatory variants in Hi-C interaction and gene expression, Washu Epigenome Browser [
82] was used. In this analysis, the Hi-C interactions in conjunction with gene expression, ChIP-Seq and genomic variants data was used.
Pathway analysis
To validate the capability of AGV in identifying meaningful genes, we used ShinyGO v0.4 [
27]. It contains 72,394 gene-sets for human genome including KEGG [
83], MSigDB [
84], GeneSetDB [
85], REACTOME [
86], etc. It has also access to STRING-db [
87] for retrieving protein–protein interaction networks. we analyzed a set of 131 genes (
Supplementary Table 6) which identified as the potential regulatory genes in our analysis. These genes list is mapped to the all collection of human gene-sets in ShinyGo for enrichment analysis. ShinyGo uses a hypergeometric distribution over-representation test to calculate the p value for gene set overlaps. We run ShinyGo with the default values (
) (
Supplementary Table 8).
Figure 1.
An overview of AGV. AGV pipeline first makes a list of associated genomic variants including GWAS SNPs, somatic point mutation and CNV regions. AGV then uses Hi-C and H3K27ac to determine variants with possible rules in disrupting promoter-enhancer interactions. Lastly, AGV reports a list of functional genomic variants with possible role in PC.
Figure 1.
An overview of AGV. AGV pipeline first makes a list of associated genomic variants including GWAS SNPs, somatic point mutation and CNV regions. AGV then uses Hi-C and H3K27ac to determine variants with possible rules in disrupting promoter-enhancer interactions. Lastly, AGV reports a list of functional genomic variants with possible role in PC.
Figure 2.
The schematic workflow used in this study to identify somatic point mutation hotspot regions. This analysis consists of three main steps: (1) window analysis, (2) selection, and (3) filtering. In the first step, the tool divides the genome into 21bp bins and then counts the number of samples with at least one SPM that overlapped with the window. In the selection step, a Poisson binomial distribution was used to select significant bins (P-value < 0.05). Lastly, in the filtering step, the problematic hotspot regions and chromosome Y was excluded from the final list of hotspot regions.
Figure 2.
The schematic workflow used in this study to identify somatic point mutation hotspot regions. This analysis consists of three main steps: (1) window analysis, (2) selection, and (3) filtering. In the first step, the tool divides the genome into 21bp bins and then counts the number of samples with at least one SPM that overlapped with the window. In the selection step, a Poisson binomial distribution was used to select significant bins (P-value < 0.05). Lastly, in the filtering step, the problematic hotspot regions and chromosome Y was excluded from the final list of hotspot regions.
Figure 3.
A) Distribution of SPMs for different window sizes (500bp, 50, 21, and 9bp) on chromosome 22. B) Probability distribution for identified 21 bp bins by Cullen and Frey graph, and CDF plot.
Figure 3.
A) Distribution of SPMs for different window sizes (500bp, 50, 21, and 9bp) on chromosome 22. B) Probability distribution for identified 21 bp bins by Cullen and Frey graph, and CDF plot.
Figure 4.
A genome-wide overview of somatic mutation hotspots (red dots) and filtered regions (containing masked regions and repetitive regions (blue dots) and non-significant regions (black dots)) identified in this study. The figure illustrates the distribution of 21bp bin-size windows encompassing PC related somatic point mutations across the genome. x-axis shows the window number and y-axis shows the number of case samples covered by the window. For each window, our proposed method calculates the P-value of mutation recurrence using the Poisson binomial distribution Then, problematic regions including masked regions and repetitive regions were excluded and bins with P-value < 0.001 were selected.
Figure 4.
A genome-wide overview of somatic mutation hotspots (red dots) and filtered regions (containing masked regions and repetitive regions (blue dots) and non-significant regions (black dots)) identified in this study. The figure illustrates the distribution of 21bp bin-size windows encompassing PC related somatic point mutations across the genome. x-axis shows the window number and y-axis shows the number of case samples covered by the window. For each window, our proposed method calculates the P-value of mutation recurrence using the Poisson binomial distribution Then, problematic regions including masked regions and repetitive regions were excluded and bins with P-value < 0.001 were selected.
Figure 5.
Linking Genomic variants to coding and non-coding genes. A) Somatic point mutation hotspots; B) GWAS SNPs; C) Duplicated CNVRs; D) Deleted CNVRs. The left panel shows the percentage of genomic variants associated to the genes (Annotated regions) or there are not genes relating to the genomic variants (non-annotated regions). The right panel shows the fraction of these genomic variants are protein coding or non-coding genes. The greater fraction of hotspot regions and CNVRs are located in non-coding genes, while less than 30% of GWAS SNPs are located in non-coding genes. E-I show the percentage of linking different types of genomic variants including E) Somatic point mutation hotspots; F) GWAS SNPs; G) Duplicated CNVRs; H) Deleted CNVRs into noncoding RNA. y-axis represents the number of different types of RNA associating to genomic variants.
Figure 5.
Linking Genomic variants to coding and non-coding genes. A) Somatic point mutation hotspots; B) GWAS SNPs; C) Duplicated CNVRs; D) Deleted CNVRs. The left panel shows the percentage of genomic variants associated to the genes (Annotated regions) or there are not genes relating to the genomic variants (non-annotated regions). The right panel shows the fraction of these genomic variants are protein coding or non-coding genes. The greater fraction of hotspot regions and CNVRs are located in non-coding genes, while less than 30% of GWAS SNPs are located in non-coding genes. E-I show the percentage of linking different types of genomic variants including E) Somatic point mutation hotspots; F) GWAS SNPs; G) Duplicated CNVRs; H) Deleted CNVRs into noncoding RNA. y-axis represents the number of different types of RNA associating to genomic variants.
Figure 6.
The number and distance of statistically significant Hi-C interactions in cancer cell lines (PC3 and LNCaP) and healthy cell line (PrEC).
Figure 6.
The number and distance of statistically significant Hi-C interactions in cancer cell lines (PC3 and LNCaP) and healthy cell line (PrEC).
Figure 7.
A) Example of a regulatory variant in cancer cell line PC3. The figure demonstrates the RNA-Seq, H3K27ac signals, and Hi-C chromatin interactions map of normal (PREC) and prostate cancer (PC3) cells on Chromosome 8. The highlighted box shows one of the PC-associated CNVR identified in this study that was also observed in PC3 WGS. There is an EPI in cancer cell line (the EPI was not observed in the healthy cell line) that the enhancer side of the interaction overlapped with CNVR. Interestingly. The left side of this interaction is promoter regions of RNF139 and TADN1, and the right side (enhancer region) has also an active H3K27ac signal. The expression of NDUFB9 is much higher in cancer cell compared to healthy cell line. B) Example of regulatory variant in cancer cell line LNCaP. The figure demonstrates the RNA-seq, H3K27ac signals, and Hi-C chromatin interactions map of normal (PREC) and prostate cancer (LNCaP) cells on Chromosome 2. The highlighted box shows one of the PC-associated CNVRs identified in this study that was also observed in LNCaP WGS. There is an EPI in cancer cell line (the EPI was not observed in the healthy cell line) that the enhancer side of the interaction overlapped with CNVR. Interestingly. The left side of this interaction is promoter regions NAT8 and ALMS1P genes, and the right side (enhancer region) has also an active H3K27ac signal. WashU Epigenome Browser has been used to generate the figure.
Figure 7.
A) Example of a regulatory variant in cancer cell line PC3. The figure demonstrates the RNA-Seq, H3K27ac signals, and Hi-C chromatin interactions map of normal (PREC) and prostate cancer (PC3) cells on Chromosome 8. The highlighted box shows one of the PC-associated CNVR identified in this study that was also observed in PC3 WGS. There is an EPI in cancer cell line (the EPI was not observed in the healthy cell line) that the enhancer side of the interaction overlapped with CNVR. Interestingly. The left side of this interaction is promoter regions of RNF139 and TADN1, and the right side (enhancer region) has also an active H3K27ac signal. The expression of NDUFB9 is much higher in cancer cell compared to healthy cell line. B) Example of regulatory variant in cancer cell line LNCaP. The figure demonstrates the RNA-seq, H3K27ac signals, and Hi-C chromatin interactions map of normal (PREC) and prostate cancer (LNCaP) cells on Chromosome 2. The highlighted box shows one of the PC-associated CNVRs identified in this study that was also observed in LNCaP WGS. There is an EPI in cancer cell line (the EPI was not observed in the healthy cell line) that the enhancer side of the interaction overlapped with CNVR. Interestingly. The left side of this interaction is promoter regions NAT8 and ALMS1P genes, and the right side (enhancer region) has also an active H3K27ac signal. WashU Epigenome Browser has been used to generate the figure.
Figure 8.
Enriched proteins in the KEGG pathway for cancer.
Figure 8.
Enriched proteins in the KEGG pathway for cancer.