1. Introduction
Pancreatic ductal adenocarcinoma (PDAC) is the most common neoplastic disease of the pancreas. It is classified as one of the most chemoresistant types of cancer due to its broad intra-tumoral genetic and functional heterogeneity and characteristic dense stromal environment [
1,
2]. Despite extensive work toward improving diagnostic techniques, surgical procedures, and chemotherapy, the prognosis of PDAC patients is still poor with a 5-year overall survival (OS) rate of ∼13% [
3], indicating an urgent need for the development of effective interventions or new targeted therapies.
Bulk genomic analysis and expression profiling of clinical specimens have shaped much of our understanding of PDAC in patients, including epigenetic changes, gene mutations and copy number alterations that facilitate clonal selection and eventually contribute to the malignant phenotype of the cancer cells [
4]. With the advent of single-cell RNA sequencing (scRNA-seq) it is possible to determine the transcriptome of individual cells, allowing the identification of cell populations and their functions and bringing into order the apparent chaos, which made it seem like every cell was very different from the others and therefore carried out its activity in a manner that was poorly coordinated with the other tumor cells.
In 2009, the first study using scRNA-seq was published [
5], and since then, numerous studies have been conducted by specialized laboratories with expertise in wet-lab single-cell genomics and bioinformatics. Moreover, the availability of various scRNA-seq platforms and the continuous improvement of bioinformatics methods have now made it possible for any biomedical researcher or clinician to use scRNA-seq and make new discoveries [
6].
In PDAC, scRNA-seq is being used for the dissection of heterogeneity in the cell populations within the tumor and promises to reveal correlations between the presence of cell populations and prognosis and guide better clinical decisions for personalized treatment. For example, scRNA-seq performed on cells of primary tissues from ten different patients with localized and metastatic PDAC [
7] showed distinct cell types including tumor cells, immune cells, cancer-associated fibroblasts (CAFs), and endothelial cells. Variations in gene expression patterns among patients were detected and it was possible to distinguish two populations of tumor cells. One population displayed EMT (epithelial-mesenchymal transition) characteristics and the other exhibited epithelial characteristics. A more aggressive form of the disease and a worse prognosis was linked to the presence of the EMT tumor cell population. In addition, the expression profile of immune cells (macrophages and T cells) from the different individuals showed high similarities, in both primary and metastatic tumors. Three major CAF clusters were also identified on the basis of gene expression, supporting the existence of 3 functional subtypes of CAFs: antigen-presenting CAFs (apCAFs), inflammatory CAFs (iCAFs), and myofibroblastic CAFs (myCAFs) in PDAC tumors [
8,
9]. A high expression of the signature genes associated with one of the CAF subtypes (myCAFs) negatively correlated with patient survival [
7].
It should be noted that by scRNA-seq, the transcriptomic profile of individual cells can be read, and the sequencing data subsequently processed using bioinformatic tools to identify the cell types [
10]. This identification procedure is neither trivial nor unique because on one hand the sequencing depth of scRNA-seq is not high hence we only know a part of the transcripts of each cell, and on the other hand the gene transcripts whose presence or absence accurately distinguish cell types are poorly known [
11]. Currently, the identification of malignant cells from scRNA-seq data is performed by marker-based methods and by inferring Copy Number Variations (CNVs) [
12,
13].
Tumor marker genes refer to genes that are predominantly expressed in a malignant tissue but lowly or not expressed in normal tissues, and marker-based methods aim to identify cancer cells or clusters based on the expression of these canonical cancer-specific marker genes [
14].
However, probably all the markers capable of identifying PDAC cells are not known yet and attempts to identify them are faced with various challenges. For instance, tumors are complex tissues surrounded by a heterogeneous cellular microenvironment in which the cancer cells interact [
15] and therefore it is important to analyse biopsies and not cell cultures. However, the RNA-seq technique has the intrinsic limitation of returning average data of all cell types present in the sample [
16]. To overcome this limitation, more advanced studies have dissected the biopsy, for example by immunohistochemical (IHC) assay, and selected the part with the highest concentration of tumor cells in an attempt to obtain more tumour-specific sequencing data [
17].
Furthermore, correct use of the known markers [
18,
19] to identify tumor cells is often not known because some of these markers only characterize tumor subpopulations and therefore do not necessarily need to be present in every cell. In addition, the weight to assign to the presence of each specific marker to reach an acceptance threshold is not known. Therefore, we still cannot rely solely on markers as an accurate method of identifying tumor cells [
20,
21].
In PDAC, for example, the identification of tumor cells often involves the use of epithelial markers (e.g., E-cadherin or EpCAM) as well as intracellular markers such as the lineage-specific cytokeratin subtypes. The epithelial cell adhesion molecule (EpCAM) protein is often expressed on normal epithelial cells and is overexpressed in a subset of tumor cells such as adenocarcinomas of the stomach, colon, prostate, and pancreas [
22,
23,
24]. However, these genes may not always be highly expressed by cancer cells and hence give rise to false negatives.
Copy number variations (CNVs) are the predominant form of variation resulting from the genomic rearrangements referred to as structural variation (SV). CNVs can lead to gene amplifications (copy number gain) and deletions (copy number loss) that could correspond to increases or decreases in the amount of RNAs and proteins encoded by the gene [
25]. CNVs have been described as a hallmark of all cancers, with several studies demonstrating that variations in the number of copies of specific genes (especially for oncogenes and tumor suppressor genes) play a role in tumor initiation, development, progression, drug sensitivity and resistance by affecting the expression levels of these genes [
26,
27,
28,
29]. The occurrence of CNVs in PDAC have also been studied, including those with pathogenic roles in the development and progression of the disease [
30,
31], and their detection leveraged to facilitate the identification of tumor cells from scRNA-seq experiments [
32]. This is accomplished through the various software available for inferring CNVs from scRNA-seq data such as sciCNV [
33], InferCNV, CopyKAT [
13] and SCEVAN [
34].
In the literature about scRNA-seq, tumor cells are identified using markers and then these results are confirmed using software that predicts the presence of CNVs. Therefore, it is assumed that the cells considered as cancerous are those identified both by biomarker methods and by the inference of CNVs.
In this work we analysed PDAC scRNA-seq data to verify the concordance of the predictions based on biomarkers and those based on inferring CNVs, and which of the CNV inference tools provides the best performance.
3. Results
3.1. Cell Composition of Primary and Metastatic Tumor Tissues and Normal Pancreas Controls
To select immune cells to use as reference cells for InferCNV and sciCNV analyses, we annotated scRNA-seq data from PDAC patients (n = 5), adjacent normal pancreas (n = 2) and normal (n=1) using SingleR tool. Various cell types were identified in each sample (
Table 3) including non-immune cells (e.g., epithelial cells, endothelial cells, fibroblasts, and tissue stem cells) and immune cells (e.g., macrophages, monocytes, T cells, NK cells, dendritic cells (DC), and B cells). A more detailed annotation of the cell types in each sample using SingleR is reported in
Supplementary Table S2.
3.2. PDAC Tumor Cell Identification Using a Marker-Based Method
To predict tumor cells based on the expression of PDAC tumor markers, we collected markers genes from an online literature search which resulted in the inclusion of a total of 21 published articles (
Supplementary Table S1) [
7,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58,
59]. The studies we retrieved had been performed on biopsies from treatment-naïve PDAC patients with enrichment procedures being reported in only two of the studies. Validation studies on cell lines and datasets from The Cancer Genome Atlas (TCGA), Genotype-Tissue Expression Project (GTEx) and Gene Expression Omnibus (GEO)-NCBI databases were recorded in most of the articles. The main techniques reported in the studies were microarray, RNA-sequencing and scRNA-seq, with validations using Quantitative real-time PCR (qRT-PCR) and Immunohistochemistry (IHC). We report 134 markers, some of which are more recently discovered (e.g., SYT8, VSIG2). Only two transcriptomic signatures reported were composed of co-expressed genes: the cancer stem cell markers CD44, CD24, ESA; and TSPAN8, SOX9. We do not exclude the possibility that there may be other relationships of co-presence or mutual exclusion among biomarkers to identify a cancer cell, but this information was not reported in the studies we considered.
After exclusion of the markers that were not expressed in any of our samples, we obtained a total of 91 marker genes reported to be highly enriched in PDAC. By the Trailmaker browser, we assessed the expression of these marker genes in the cell clusters of each sample in our study and pooled the cells from all the clusters predicted to be cancerous in each sample. As expected, no clusters with a tumor profile were found in the Normal_N1 sample taken from the healthy patient. However, clusters with tumor transcriptomic profiles were observed in the normal samples adjacent to the PDAC tissue (Supplementary
Figure S1a-h).
However, it is known that one common source of normal control samples in cancer research is the normal adjacent tissue (NAT) [
60]. Although NATs should not contain malignant tumor cells, they are frequently not histologically normal and exhibit significant inflammation and other alterations that could lead to their expression of tumor-markers [
61]. For instance, some researchers analysed the transcriptomes of tumor and NAT tissues across eight tumor types and healthy tissue of the same kind and observed that the NATs exhibited a distinct state that segregates as an intermediate between tumor and healthy tissues, possibly as a result of their inflammatory reaction to the tumor tissue [
60].
3.3. PDAC Tumor Cell Identification using Inference of CNVs
CNV profiles of the tumor and control samples were obtained across the four CNV inference tools. InferCNV and sciCNV do not directly identify malignant cells, but rather detect alterations in chromosomal copy numbers that assumed to be indicative of malignancy. This process requires the designation of a subset of cells as ‘normal’ cells by the user, which will be used as references for comparison and computation of the cell CNV scores. On the other hand, CopyKAT and SCEVAN use gene expression data to look for the presence of aneuploidy. The rationale behind their prediction of tumor/normal cell states is that aneuploidy is a common occurrence in human cancers (90%). Immune cells and stromal normal cells usually have diploid (2N) or nearly-diploid copy number profiles, whereas cells exhibiting widespread copy number aberrations throughout the genome (aneuploidy) are classified as tumor cells [
13,
34,
62,
63].
3.3.1. sciCNV
sciCNV returns a score for each cell by comparing the CNV profiles of single cells to the average CNV profile of test cells. The authors of sciCNV state that the tumor CNV score enables the detection of malignant cells from normal diploid cells that harbor the CNVs specific to the tumor [
33], yet no further guidance or criteria on how to accomplish this is described. We therefore retrieved the tool’s CNV score output file (
TotScore) for each sample and visualized them in density plots (
Figure 1a-h). However, since no separate distributions were highlighted, we were unable to distinguish the normal cells from the tumor cells based on the CNV scores. We further plotted the tumor cells predicted by markers in the sciCNV score density plots for each sample and observed that these were distributed uniformly throughout the CNV score density plots (
Figure 1a–h), therefore we were unable to establish clear criteria of distinguishing the tumor cells from normal cells.
3.3.2. InferCNV
CNV profiles of the tumor and control samples were visualized in heatmaps (
Supplementary Figure S2). The amplifications observed in chromosome 6 of the reference heatmaps were expected and due to the overexpression of the major histocompatibility complex (MHC) genes characteristic of immune cells used as the reference cells in the analysis. These genes encode proteins that play a crucial role in presenting antigens derived from pathogens to the immune system of the host, and studies have suggested that they exhibit significant variation in copy numbers both within and between species [
64]. We calculated the CNV score of each cell as the quadratic sum of each CNV region and visualized these as density plots (
Figure 3). For most of the samples we observed a bimodal distribution which allowed us to set a threshold between the two peaks by hypothesizing that the profiles with the lowest and highest score identified normal and tumor cells, respectively. However, we also imposed that the cells to be considered as cancerous must have a CNV score greater than 12. We deduced this threshold from the observation of the CNV scores of bimodal distributions as well as of the cells derived from the normal healthy sample (Normal_N1).
3.3.3. CopyKAT
Heatmaps displaying the CNV profiles across the chromosomes for each sample were generated through the CopyKAT pipeline (
Supplementary Figure S3). The cells were classified as either aneuploid or diploid in the CopyKAT output and considered as cancer and normal cells respectively, as suggested by the authors of the software.
3.3.4. SCEVAN
In the SCEVAN output file, cells were defined as either normal or tumor cells, and CNV profiles of the predictions as well as subclones of the predicted tumor cells were visualized in the form of heatmaps (
Supplementary Figure S4). We further noted that the frequency of amplifications exceeded that of deletions for all samples, which is in agreement with previous reports [
65] and may be a result of selection on deletions [
66].
3.4. Comparisons Among the Methods of Inferring CNVs
In
Table 4 we show comparisons among the number of epithelial cells, cancer cells identified by markers, their overlap, and tumor cells predicted by three CNV based prediction tools in each sample. Here we have chosen to consider the tumor cells identified by a marker-based method as the positive reference against which to compare the results of the CNV inference algorithms.
Since it is known that PDAC cells are of the epithelial type, we expected that the number of tumor cells obtained through the markers or predicted through the increased presence of CNVs does not exceed that of epithelial cells. The reliability of these predictions was confirmed by the nearly complete overlap observed between the epithelial cells and tumor cells predicted by markers, that is, almost all predicted tumor cells are epithelial cells and majority of the epithelial cells are tumor cells.
sciCNV predictions were excluded from this comparison because it was not possible to identify a criteria or threshold (as can be observed from the distribution in
Figure 1) beyond which to consider the cells as tumorous.
Regarding the predictions by CNV inference tools, InferCNV often exceeds the number of epithelial cells in PDAC patients whereas CopyKAT and SCEVAN sometimes return lower numbers suggesting a low specificity for InferCNV and a low sensibility for CopyKAT and SCEVAN.
In the normal adjacent tissue samples, assuming that cells with a transcriptomic profile that resembles that of a tumor cell are present, it can be observed that in AdjNorm_1 and AdjNorm_2 the tools predict a number of tumor cells that are more than and less than that of epithelial cells respectively, thus highlighting a strong dependence of the predictions on the samples.
However, it is surprising that CopyKAT and SCEVAN tools indicate the presence of tumor cells in the completely normal sample (Normal_N1).
It is important to note that in
Table 4, the intersections between biomarker predictions and those obtained through CNV inference tools have not been taken into consideration. For example, in PDAC_1 patient we cannot deduce whether the 146 tumor cells identified by markers are included in the 979 cells predicted via InferCNV. This type of analysis, that is, the overlap among predictions is shown by Venn diagrams in
Figure 2a–h.
In the PDAC_1 sample, most tumor cells are predicted by InferCNV and SCEVAN tools, but InferCNV and CopyKAT return the highest number false positives. For the PDAC_2 sample, all tools detect most of tumor cells (733, or 70.95%) and InferCNV is the software with the most false positives. In PDAC_3, InferCNV and CopyKAT predict most of the tumor cells. In PDAC_4 and PDAC_5, InferCNV predicts 94.1% (2640) and 81.9% (1881) of the cancer cells respectively. In the AdjNorm_1 and AdjNorm_2 samples we observe the highest number of false positives predicted by CopyKAT and SCEVAN. In the normal pancreas sample Normal_N1, InferCNV does not identify any tumor cells, unlike CopyKAT and SCEVAN.
In Supplementary
Tables S2–S4, we have reported data of the diagrams of
Figure 2 and detailed calculations of the performance of the tools using the cells identified by markers as reference tumor cells. Performances of the three tools are summarized in
Table 5. The objective was that the tumor cells predicted by the three CNV inference tools be similar to those identified by markers, even in the AdjNorm samples.
Notably, the performances of the software are strongly sample-dependent, with InferCNV and SCEVAN being the most sensitive and the most specific tools respectively. However, almost all the software seem to lack sensitivity in the samples adjacent to the tumor tissue, but it is worth considering that these tools could have potentially identified the real tumor cells among those with an expression profile similar to that of the tumor. If so, the three prediction tools would allow us to filter the cancer cells obtained by the markers.
3.5. Comparison between Tumor Cells Predicted by InferCNV and Markers
At this point, we wanted to verify whether the low specificity of InferCNV was due to our choice of threshold of the computed CNV score as well as whether the tumor cells identified by the markers are those with the highest CNV score, and hence confirm the assumption that CNVs are a hallmark of all cancer. InferCNV returns a score for each gene and the total scores of a cell can be calculated as the quadratic sum of the scores [
67]. We show the distribution of the CNV scores of each cell in each of our samples in
Figure 3a–h, distinguishing tumor cells identified by markers (yellow color) compared to the total number of cells (blue color). In the PDAC samples (
Figure 3a–e), we observe all tumor cells identified by markers to be generally above the threshold (≥12) apart from PDAC_5 (Met) that present cancer cells even below it (<12). Hence the establishment of a cut-off between the two peaks in the bimodal distribution enables the complete identification of tumor cells, although this may also result in the inclusion of many normal cells. For example, the PDAC_1 sample has many normal cells (verified to be mostly immune cells) with the same CNV scores as the tumor cells therefore indicating a high number of false positives by InferCNV. The AdjNorm_1 sample also presents a bimodal distribution with the CNV score cut-off at around 12 (
Figure 3f) suggesting that beyond this threshold there are tumor cells, but this assumption is contradicted by the markers. Further analysis revealed that many of these false positives were immune cells therefore suggesting that the cells predicted to be tumorous by biomarkers may be in an early stage of disease and hence have a low number of CNVs, or it may proof that they should not be considered as tumor cells because they have fewer CNVs compared to the normal cells. As expected, cells taken from the normal pancreas tissue has a lower CNV score than those from PDAC patients (
Figure 3h).
Figure 3.
Density plots showing the distribution of sums of the quadratic CNV scores returned by InferCNV for primary (a-c) and metastatic (d-e) PDAC, normal adjacent (f-g) and normal pancreas (h) samples. The CNV score calculated as quadratic sum of CNV region is displayed on the x-axis.
Figure 3.
Density plots showing the distribution of sums of the quadratic CNV scores returned by InferCNV for primary (a-c) and metastatic (d-e) PDAC, normal adjacent (f-g) and normal pancreas (h) samples. The CNV score calculated as quadratic sum of CNV region is displayed on the x-axis.
4. Discussion
A In this work we compared different tools that analyse single cell RNA-seq data in order to facilitate the identification of tumor cells, based on the fact that CNVs are a hallmark of all cancers and therefore these cells should present a higher number of cancer- associated CNVs compared to other cells.
Tumor cells can also be identified by the presence of specific gene markers. However, even though several of these cancer biomarkers are known, there are two main challenges in their use. On one hand, certain gene markers may be patient-specific (i.e., derived from subgroups of patients). Consequently, the exclusion/inclusion relationships indicating which markers are indispensable, interchangeable or mutually exclusive to define a cancer cell remain unknown. In other words, the markers that constitute a necessary but not sufficient condition to define a tumor cell are not known.
On the other hand, the single cell RNAseq technique does not yet guarantee a good sequencing depth so some cells may not be recognized as tumor because the expression of some of their markers may not have been detected [
68]. For these reasons, the authors who analyse single cell data group cells in clusters and base the cancer cell detection not only on the expression of biomarkers but also on the presence of CNVs. To accomplish the latter, tools exist that process expression data to infer CNVs, and from these, predict which cells are tumorous. Using scRNA-seq data from primary and metastatic PDAC samples, healthy and adjacent healthy tissues, we show a poor concordance of predictions among these tools, and a strong sample-dependence. In addition, each of the algorithms we evaluated has its own strengths and weaknesses.
From our comparisons, it emerged that InferCNV algorithm has the highest sensitivity. However, significant drawbacks of InferCNV include its i) inability to identify precise coordinates of chromosome breakpoints or copy number segments, as it can only provide smoothed averages of gene windows; ii) manual choice and input of reference cells [
69,
70] which can influence the results; iii) long analysis time (days) even on a computer with an i9 13th gen processor.
The CopyKAT method, which classifies non-malignant and malignant cells automatically, represents a way to get beyond these restrictions. Unfortunately, it showed poor sensitivity and specificity in our comparisons. This was especially evident in the PDAC_1 and Adj_Norm_1 samples in which ~80% of the predicted tumor cells were the immune cells and not epithelial cells as expected in PDAC. As already known, these problems could be because in datasets with few or no tumor cells, CopyKAT mistakenly identified CNV events in cells exhibiting the highest gene expression levels. Nevertheless, the authors of CopyKAT acknowledge this limitation and suggest that the implied CNV occurrences can be disregarded in such instances [
13].
SCEVAN was the fastest among all the four software but is based mainly on the search of aneuploidy [
34] and was unable to analyse samples with <1100 cells. SCEVAN performs a stringent filtration to work with high quality data, and its authors claim that it works well in samples with a significant amount of immune infiltration. It is therefore particularly suitable for studies that involve the analysis of heterogeneous cell populations, for example to gain insights into the interaction between malignant cells and their microenvironment [
71]. Although our data meets these recommendations, SCEVAN appears to have low sensitivity and medium specificity.
We should take into account that the identification of tumor cells based solely on inferring CNVs (without the use markers) relies on the presence of genetically unstable cells and might therefore not be able to detect all cancer cells since some of them may have nearly diploid genomes or minimal variations in their genomic structure [
12,
13]. As a result, most studies claim to use a combination of both marker-based methods and inference of CNVs to provide more accurate results, even though the specific criteria used remain unclear.
Nevertheless, it is important to acknowledge certain limitations in this study. First, we considered the tumor cells identified by markers as the benchmark for evaluating the results of the CNV inference software and therefore may not be completely accurate. In fact, there is lack of knowledge on how to apply tumor markers, and the problem of defining an identikit of a tumor cell is also present in pathological anatomy where many PDAC markers are available: PTX3, LGALS9, ENO1, REG4, POSTN, CA242, LGALS1, SERPINB5, pVHL, CA125, MUC5AC, THBS2, LTBP2, CPA4, IMP3, CD13, DKK1, KOC, S100P, MSLN, MUC1, MUC4, PAM4, CA19-9, GPC-1, ANXA10, CLDN18, KRT19, KRT7, KRT17, CEA, CLDN4 [
72,
73,
74]. Here too the correct use of markers is debatable because each of the protein markers is characterized by its own sensitivity and specificity, and some markers (such as CA19-9) have recently proven to be less specific than we thought [
74]. Second, the false positives and negatives that we have highlighted in some cases could be solely due to the parameters used in processing, but the users of these tools are not guided in choosing among the many possible parameter combinations, each of which provide different results. In addition, these settings should also take into account the total number of cells in a sample. Third, some tools, such as InferCNV and sciCNV, do not directly show which cell are cancerous but rather return the number of CNVs without indicating a threshold beyond which to consider a cell as tumorous. In these cases, one can hope to identify a bimodal distribution of CNV scores and establish a threshold that distinguishes the two cell populations. Fourth, although there is a formula suggested by the authors of InferCNV to process the CNV scores of each cell from InferCNV output, different formulae could be evaluated and used to increase the accuracy of tumor cell identification. Furthermore, formulas that associate a greater weight with the genes involved in cancer onset, development and progression should be evaluated. Lastly, the users of the software are not guided in choice of the best approach for accurate identification of tumor cells, that is, whether to intersect the predictions of biomarkers with those of CNVs or to use the latter to expand those found with markers alone. In other words, whether CNVs have to be used to increase the specificity of biomarker predictions or increase their sensitivity.
In conclusion, the accurate and sensitive detection of CNVs in scRNA-seq data remains challenging despite the many available algorithms, and therefore the development of more accurate tools is still needed. Programs that integrate the prediction of CNVs with the presence of mutations would also be of importance because the amplification of an oncogene has greater effect than the amplification of a proto-oncogene. Similarly, the loss of one copy of a tumor suppressor gene is much more serious if the remaining copy is mutated so it is no longer functional (the two-hit hypothesis).