1. Introduction
Regulatory networks govern cellular diversity and responses, making them crucial to functionally characterize the pathways that influence cell fate and behaviour for developing advanced targeted and combination therapies (Jaitin et al., 2016; Replogle et al., 2022). However, analysing cell-type-specific changes in complex organs and differentiation hierarchies is challenging due to the intricate nature of biological circuits, which involve redundancies, nonlinear interactions between pathways, and variability in cellular plasticity and heterogeneity in both in vivo and in vitro models. There is a need for genome-wide approaches that can identify interactions between genetic elements at the single-cell level (Jaitin et al., 2016; Datlinger et al., 2017).
In cancer research and clinical practice, identifying driver genes and pathways is essential (Beroukhim et al., 2010; Martínez-Jiménez et al., 2020; Bailey et al., 2018). A gene expression change induced by a cancer variant compared to the control or wild type (WT) allele can hint at its biological role but does not directly indicate physiological outcomes like tumour development or drug response (Ursu et al., 2022). Frequency-based methods in genome-wide screens have been effective in identifying potential driver genes by finding those with unexpectedly high mutation rates during cancer progression (Wood et al., 2007; Jones et al., 2008; Parsons et al., 2008; Ding et al., 2008; Cancer Genome Atlas Research Network, 2008). This framework has evolved with Perturb-seq, a novel high-throughput method that allows for multi-locus gene perturbation with single-cell transcriptome profiling (Dixit et al., 2016; Adamson et al., 2016; Jaitin et al., 2016; Replogle et al., 2022).
Yet, the resulting list of potential driver genes often lacks a unified biological background linking these genes to specific phenotypes and identifying alternative targets of interest (Wu et al., 2010; Yang et al., 2021). To overcome these limitations, new network-based methods have been developed that combine genomics data with protein–protein interaction (PPi) network topology (Yang et al., 2023; Leiserson et al., 2015; Levi et al., 2021; Chitra et al., 2022).
For instance, numerous mutations in TP53 and KRAS have been discovered in lung cancer. Since these proteins influence various cellular pathways, it's important to understand not just the function of each mutation, but also how it might impact specific pathways. However, the underlying cellular and molecular mechanisms remain unclear (Canale et al., 2022; Berger et al., 2016). Existing methods vary, leading to inconsistent results, and mainly focus on static interaction networks. Grouping variants with similar effects could reduce computational costs, and creating an expression-based atlas of cancer variant impacts could serve as an initial framework for annotating allele functions (Ursu et al., 2022). Iterative analysis can also improve the accuracy of results by prioritizing and validating significant interactions and addressing issues with noisy datasets that cause false positives (Replogle et al., 2022; Wu et al., 2010).
This project presents a comprehensive and efficient pipeline that compares and integrates various bioinformatics tools and methodologies, based on extensive literature review, to successfully distinguish impactful variants, identify variants with similar expression profiles, and retrieve key affected modules, such as those observed in KRAS and TP53 variants (
Figure 1). The pipeline begins with the high-throughput counts matrix obtained from any Perturb-seq experiment. The data is then processed to remove low-quality genes and cells, followed by normalization of the counts. Impactful variants are identified using the Hotelling T² statistic, while similar variants are grouped by Spearman correlation. Genes from each variant cluster are input into STRING to create a PPi network in Cytoscape, enabling the recovery of interconnected nodes (
Szklarczyk et al., 2019; Shannon et al., 2003). These nodes are further annotated with log fold changes (LFCs) and p-values via differential expression analysis. This framework serves as the input for an implementation of the Hierarchical HotNet algorithm, which addresses the limitations of current network-propagation methods (
Reyna et al., 2018). Detected altered subnetworks are divided into biologically meaningful modules using the MCODE tool in Cytoscape and annotated in Reactome, so interactions are cross-validated too (
Bader et al., 2003; Vastrik et al., 2007). All code for replicating the analysis, along with documentation on pipeline usage, is available at
https://github.com/jesusch10/perturbseq_analysis.
1.1. Perturb-seq as standard strategy
Bulk RNA sequencing (bulk RNA-seq) and single-cell RNA sequencing (scRNA-seq) are both methods widely used to analyse gene expression profiles in cells. Unlike bulk RNA-seq, which captures the average gene expression across a population of heterogenous cells masking subtle and time-dependent transcriptional differences, scRNA-seq allows to study the transcriptome of individual cells identifying key regulators (Yu et al., 2021). However, this technology is inherently descriptive and does not establish causality unless combined with perturbation models. When integrated with such models, genetic interactions can reveal functional relationships between genes and pathways in different cellular contexts, though this is constrained by the number of single cells that can be sequenced (Jaitin et al., 2016).
Genetic screens systematically assess gene function in eukaryotic cells and can be designed in two main formats: (1) an "arrayed" format in which each perturbation is applied and analysed separately to interpret the combined nonlinear effects of multiple factors, or (2) a "pooled" format, which is more efficient and scalable and allows for the examination of higher-order interactions but is limited to simpler readouts (Dixit et al., 2016).
Pooled genetic screens are effective tools for discovering mechanisms affecting cell survival, proliferation, drug resistance, and viral infection. However, they rely on basic phenotypic readouts that average population properties and do not distinguish between different perturbations causing similar responses or account for bulk phenotypes driven by diverse cell subpopulations (Adamson et al., 2016; Jaitin et al., 2016). Additionally, as CRISPR/Cas9 is often used for pooled screens, cells must be physically separated to receive individual gRNAs, limiting the resolution needed for transcriptome profiling and understanding complex phenotypes (Jaitin et al., 2016; Datlinger et al., 2017; Replogle et al., 2022). Mapping genetic changes to their phenotypic consequences traditionally involves either a (1) phenotype-centric or "forward genetic" approach which identifies genetic changes driving specific phenotypes, or (2) a gene-centric or "reverse genetic" approach, which catalogues phenotypes resulting from specific genetic changes. Both approaches have limitations, with forward screens typically using simple phenotypes that may conflate different mechanisms and reverse screens being limited to chosen targets limiting systematic comparisons (Replogle et al., 2022).
To address these challenges, a novel high-throughput method called Perturb-seq has been developed. This technique combines CRISPR/Cas9 for multi-locus gene perturbation with high efficacy and specificity, and massively parallel droplet-based scRNA-seq with high resolution. Perturb-seq generates comprehensive genotype-phenotype maps by profiling hundreds of thousands of separately perturbed cells, providing detailed phenotypic readouts (Dixit et al., 2016; Adamson et al., 2016; Jaitin et al., 2016; Replogle et al., 2022).
Perturb-seq enables the analysis in vitro and in vivo of complex phenotypes from large numbers of perturbations and combinations, reducing the time and cost associated with transcriptome assays without needing complex reagents (Dixit et al., 2016; Jaitin et al., 2016; Datlinger et al., 2017; Replogle et al., 2022). By extending this method to other high-dimensional molecular phenotypes or diverse cell metadata, the same experiment can reveal genetic interactions and contributions of PPi (Dixit et al., 2016). Many well-characterized human genes serve as controls to help interpret comprehensive datasets (Replogle et al., 2022).
The Perturb-seq pipeline begins with a robust triple cell barcoding strategy that encodes CRISPR-mediated perturbations in expressed transcripts captured during scRNA-seq. This involves two indices included in cDNA molecules: (1) a cell barcode (CBC) to label all mRNA from a specific cell, and (2) a unique molecular identifier (UMI) for molecular counting of captured mRNA correcting for PCR duplicates (Adamson et al., 2016). The third index, a guide barcode (GBC), is a synthetic transcript usually transduced into cells by lentiviruses to identify the specific perturbation, such as a Cas9-targeting single guide RNA (sgRNA); UMI usually goes in the same viral vector along with Cas9 (Adamson et al., 2016; Jaitin et al., 2016; Datlinger et al., 2017).
To manage sequencing costs, methods like target amplification or depletion of high-abundance genes can be used. As technology matures, costs per cell are decreasing, but analysing the vast output of noisy data remains challenging. This data needs to address various aspects such as differential gene expression, RNA splicing, TE expression, differentiation, transcriptional heterogeneity, cell-cycle progression, and chromosomal instability (Adamson et al., 2016; Replogle et al., 2022). Moreover, for detecting alleles with strong effects, or possibly alleles with smaller effect sizes by grouping similar variants, a calculation of 20 cells per variant is enough (Ursu et al., 2022).
Perturb-seq is not restricted to coding genes or gene knockdown but can also perturb non-coding RNAs, promoters, and enhancers (Jaitin et al., 2016). On the other hand, Perturb-seq is compatible with various CRISPR-based perturbations, including CRISPR interference (CRISPRi), which allows for sequence-specific repression of gene expression in both prokaryotic and eukaryotic cells (Replogle et al., 2022; Jensen et al., 2021). Studies applying CRISPRi have successfully produced loss-of-function perturbations, though as described the data generated on different transcriptional phenotypes is limited (Adamson et al., 2016; Replogle et al., 2022; Wu et al., 2022; Tian et al., 2019).
1.2. Resolving affected biological pathways by interpreting high-throughput data
Emerging functional genomic tools are focused on analysing biochemical events that drive cellular processes, crucial for understanding the progression of various cancer types (Zheng et al., 2023; Vogelstein et al., 2004). Different cancer types can originate through multiple routes (Vogelstein et al., 2004). Yet, large-scale genome-wide screening projects have identified common core signalling pathways involved in cancer etiology (Wood et al., 2007; Jones et al., 2008; Parsons et al., 2008; Ding et al., 2008; Cancer Genome Atlas Research Network, 2008). By mapping mutated, amplified, or deleted genes onto biological pathways, researchers can identify statistically significant clusters of otherwise unrelated genes, revealing the biological pathways affected by these genetic alterations (Wu et al., 2010).
Gene regulatory networks (GRNs) illustrate the functional regulatory interactions between DNA-binding transcription factors (TFs) and their target genes (Yachie-Kinoshita et al., 2019). Reconstructing GRNs requires combining various experimental observations, such as spatial and temporal gene expression data (Keat Wei et al., 2017; Babichev et al., 2019). Many tools have been developed for GRN reconstruction (Dixit et al., 2016; Aibar et al., 2017; Margolin et al., 2006). However, these tools often treat TFs as uniform features and do not consider specific characteristics like DNA-binding domains, protein domains, structure, or subcellular localization (Holland et al., 2020; Hecker et al., 2023).
Reconstructing GRNs has significant challenges, including the need for extensive computational resources, time-consuming data processing, and the complexity of interpreting the resulting networks (Babichev et al., 2019). Network inference algorithms perform better with data from 'structural' perturbations, but they still lack robust causal inference capabilities after 20 years of research (Soranzo et al., 2007) (Saint-Antoine et al., 2023; Kim et al., 2003). One reason is that single-cell RNA-seq techniques may not yet provide sufficient resolution and expression variation for reliable GRN inference, despite rapid advances in the number of cells measured and the depth of coverage (Pratapa et al., 2020). Consequently, available GRNs likely represent only a small fraction of all in vivo interactions (Walhout, 2006).
Protein-protein interaction (PPi) networks are other static models that help to understand molecular and cellular mechanisms controlling physiology and progression of diseases (Davidson et al., 2002; Safari-Alighiarloo et al., 2014). Analysing amino acid substitutions that create deleterious phenotypes can reveal active parts of the network and their biological consequences (Ozturk et al., 2022). This information is valuable for developing effective diagnostic and therapeutic strategies (Davidson et al., 2002). The scale-free topology of PPi networks suggests a higher tolerance to random failures, making them more robust than GRNs for studying variants affecting high-degree nodes. In fact, there are much more PPi structurally resolved available than GRNs, ideal for the purpose of this project (Ozturk et al., 2022).
2. Methods
2.1. Data and filtering
Specialized software is used to demultiplex, filter, and align sequencing reads to the corresponding reference transcriptome and the reference gRNA library (Meyers et al., 2023; Zhou et al., 2011; Spisak et al., 2024; Schnitzler et al., 2024). Cell Ranger 3.0 is commonly employed for these tasks (Dixit et al., 2016; Adamson et al., 2016; Replogle et al., 2022; Ursu et al., 2022; Otto et al., 2023; Pacalin et al., 2024; Zheng et al., 2023; Butterworth et al., 2023; Ozturk et al., 2023; Jin et al., 2020). There are various methods to assign cells to variants, such as using at least one UMI supporting the variant, thresholding UMI counts, or selecting the greatest UMI counts (Ursu et al., 2022; Barry et al., 2024; Papalexi et al., 2021; Schraivogel et al., 2020).
Given the difficulty of dealing with the many ways to conduct a Perturb-seq experiment, efforts should focus on the resulting common matrix of UMI counts, with CBC as rows and gene identities as columns. The analytical pipeline purposed here aims to parse and decompose the noisy count matrix generated by Perturb-seq, which contains RNA-seq profiles of tens of thousands of individual cells, into interpretable components that separate the responses to a given perturbation at the resolution of individual cells (Adamson et al., 2016).
The raw counts used in this project are publicly accessible on Gene Expression Omnibus (accession number GSE161824 (Barrett et al., 2013). This data includes expression profiling by high throughput sequencing of 300,000 single lung cancer cells, with 98 and 99 variants, including the wild type (WT) for TP53 and KRAS experiments, respectively, already assigned (Ursu et al., 2022).
Initially, raw data is visually inspected, plotting the total number of counts per cell (UMIs) against the number of genes with non-zero counts to then set 7,000 and 50,000 as UMI range. Cells with unusually high sequencing depth are downsampled, while low-quality cells with counts below the minimum threshold are filtered out. Batch effect correction is performed by downsampling channels by a factor calculated as the UMI median of a batch divided by 20,000, the desired UMI median for sequencing depth across batches. Channels with a UMI median less than the desired value are not adjusted (Ursu et al., 2022).
As a default parameter, cells with more than 20% mitochondrial genes counts are filtered out, as high mitochondrial gene expression indicates stressed or dying cells
(Griffiths et al., 2021; Sunshine et al., 2023; Lotfollahi et al., 2023; Otto et al., 2023). For this purpose, the MitoCarta3.0 database provides updated mitochondrial gene symbols
(Rath et al., 2021). Using the Scanpy toolkit, cells with fewer than 200 non-zero count genes and genes expressed in less than 5% of cells are also filtered (
Figure 2)
(Wolf et al., 2018; Littman et al., 2023; Otto et al., 2023; Pacalin et al., 2024). Data normalization follows, adjusting each cell's total count to 10,000 and applying a natural logarithm transformation to the data matrix
(Ursu et al., 2022; Lotfollahi et al., 2023; Otto et al., 2023).
To select the most informative genes, an additional filtering step selects the most highly variable genes. Thresholds are set at 0.0125 to 4 for mean gene expression, and 0.4 for the minimum gene dispersion (variability in expression across cells)
(Zheng et al., 2017). The data matrix is then scaled to unit variance and zero mean (z-scores), and Principal Component Analysis (PCA) is used for dimensionality reduction, keeping the first 50 principal components (PCs) (
Figure 3)
(Replogle et al., 2022; Ursu et al., 2022).
2.2. Comparing variants
To quantify the deviation between each perturbation and the WT, Hotelling’s two-sample T-squared statistic (T²), a multivariate extension of the t-test for two-sample hypothesis testing, can be used in the PC space, implemented with the spm1d Python package (Hotelling, 1992) (Pataky, 2012). Hotelling T² requires at least as many PCs as there are cells
per variant, at least 20 cells per variant as mentioned, so the top 20 PCs are used to compare matrices of cells of 20 PCs for each variant (Ursu et al., 2022). This quantifies the extent to which a variant's expression profile deviates from the WT: higher scores indicate a higher impact (likely a loss-of-function), while lower scores suggest 'WT-like' perturbations (possible gain-of-function) (Butterworth et al., 2023; Ozturk et al., 2023). Non-functional variant positions are under positive selective pressure in tumours and are frequently mutated across patients (Ursu et al., 2022).
To identify 'impactful' variants that significantly change their expression profile, a null distribution of Hotelling T² statistic scores is derived by permuting the count matrix (Ursu et al., 2022). For a 95% confidence interval, the score threshold for a desired False Discovery Rate (FDR) is computed, yielding thresholds of 44.42 and 41.55 for the TP53 and KRAS experiments, respectively.
Additionally, related multiplexed perturbations with common transcriptional patterns can be clustered to provide a high-resolution characterization of altered cell fates (common phenotypic fingerprints)
(Adamson et al., 2016; Sunshine et al., 2023; Meyers et al., 2023; Zheng et al., 2023). Spearman correlation coefficients between the average expression profiles of each pair of variants are calculated, then grouping cells by similar GBC, and followed by unsupervised hierarchical clustering of the correlation matrix using L1 distance and complete linkage
(Adamson et al., 2016; Replogle et al., 2022; Butterworth et al., 2023). Spearman correlation, unlike Pearson correlation, captures monotonic relationships without assuming a specific data distribution
(Sedgwick, 2014). The hierarchy is visually inspected and cut to obtain discrete cluster assignments (
Figure 8 and 10).
2.3. Leiden clustering
To visualize single-cell datasets, cells are often represented in a low-dimensional space using uniform manifold approximation and projection (UMAP), typically with 15 nearest neighbours per cell and a minimum distance of 0.5 between embedded points (Hein et al., 2022; Yao et al., 2023; McInnes et al., 2018). Subsequently, cells from the nearest neighbour graph are clustered using algorithms like the Louvain algorithm, which is widely used for uncovering community structures (Ursu et al., 2022; Blondel et al., 2008).
However, recently the Leiden algorithm has been shown to be faster and to uncover better partitions than the Louvain algorithm
(Traag et al., 2019). It has been systematically applied to summarize genotype-phenotype relationships by clustering genes into expression programs based on their co-regulation
(Replogle et al., 2022; Littman et al., 2023; Hein et al., 2022; Yao et al., 2023). In addition, it is possible to identify sets of genes with coherent average expression profiles across variants, known as gene programs (GPs), by correlating the average profiles of genes across variants (
Figure 16). Using the gProfiler toolkit in Python, an enrichment analysis of functional terms can be performed, focusing on the KEGG Pathway Database, which includes manually drawn pathway maps representing molecular interaction, reaction, and relation networks
(Kolberg et al., 2023) (Kanehisa et al., 2017).
2.4. Differential expression análisis (DEA)
In comparative high-throughput sequencing assays, a fundamental task is analysing read counts per gene in RNA-seq to identify systematic changes across experimental conditions (Love et al., 2014; Muzellec et al., 2023; Muzellec et al., 2023). This process helps distinguish genes associated with diseases or other phenotypes from false positive hits or genes affected by the general instability of the malignant genome (Gortzak-Uzan et al., 2008). RNA-seq data, however, are often overdispersed from the assumed Poisson distributions, making it essential to focus on comparing expression values across different samples rather than absolute expression levels (Zhou et al., 2011) (Robinson et al., 2010).
DESeq2 is a method designed for quantitative analysis of differential expression, emphasizing the strength of differential expression through shrinkage estimation for dispersions and log fold changes (LFCs). This approach improves stability, reproducibility, and interpretability compared to maximum-likelihood-based solutions (Love et al., 2014). DESeq2 has been widely used to identify differentially expressed genes in various studies (Zheng et al., 2023; Ozturk et al., 2023; Wu et al., 2022; Jin et al., 2020; Tian et al., 2019).
In this context, PyDESeq2, a Python implementation of the DESeq2 workflow, is applied here for differential expression analysis (DEA). PyDESeq2 offers higher model likelihood and speed improvements for large datasets. It computes p-values using Wald tests and adjusts for multiple testing through the Benjamini-Hochberg procedure (Muzellec et al., 2023). Results are further processed here in two dataframes: one containing LFC values for each gene per variant and another with q-values for assessing the significance of genes using a 0.05 alpha threshold.
2.5. The STRING database
The STRING database collects, scores, and integrates all publicly available sources of PPi information, supplemented with computational predictions. These interactions, despite varying in specificity and strength, represent essential biological interfaces crucial for functional systems (Szklarczyk et al., 2019). The query network can be layout in Cytoscape, an open-source software project that integrates, models, and analyses biomolecular interaction networks, incorporating high-throughput expression data and other molecular states into a unified conceptual framework (Cline et al., 2007; Shannon et al., 2003).
To facilitate this task, a STRING app for Cytoscape enables users to easily retrieve, visualize, and analyse networks of hundreds to thousands of proteins through a graphical user interface. This app also stores the exact query term used to find the protein, facilitating easy linkage back to the original data (Szklarczyk et al., 2019). The network retrieved from STRING is an undirected model, meaning no distinctions are made between the nodes, and all edges (connections between nodes) are bidirectional (Borroto-Escuela et al., 2014; Saki et al., 2021).
In this context, after filtering, the remaining most variable genes are passed to the STRING query in Cytoscape with a score cutoff of 0.8 (80% confidence) and 0 as the maximum additional interactors. These parameters provide a narrowed high-confidence network easy to explore (Doncheva et al., 2023). From this network, the nodes directly connected to the main network, including the perturbed gene, are selected to create a separate subnetwork. This highly interconnected and enriched subnetwork includes all downstream signalling components and various protein complexes and functional categories, serving as the basis for further analysis (Belk et al., 2022).
While STRING offers higher coverage by containing collections of pairwise relationships among proteins and genes, it does not necessarily indicate a biologically functional relationship between interacting genes or proteins (Wu et al., 2010). Therefore, results from STRING must be curated and validated through pathway projects to clean the data and trainpredictive models effectively.
2.6. Adaptation of Hierarchical Hotnet
Network-propagation methods are crucial for identifying functional changes, predicting drug targets, and exploring drug synergy in cancers (Liu et al., 2020). These methods operate on an undirected graph to identify significant interactions common among variants within the same cluster.
Given a network or graph with vertices, edges and scores/weights on the vertices, many methods address the identification of altered subnetworks. These methods fall into three broad categories: (1) those that define a restricted class of candidate subnetworks and select high-weight subnetworks based on vertex scores (Ideker et al., 2002; Beisser et al., 2010); (2) those that jointly examine vertex scores and network topology to identify subnetworks not apparent from scores or topology alone (Cowen et al., 2017; Chung et al., 2010; Cho et al., 2015; Vanunu et al., 2010; Vandin et al., 2012; Leiserson et al., 2015); (3) and those that incorporate additional information like predefined pathways (Vaske et al., 2010; Dao et al., 2012; Dao et al., 2017; Ciriello et al., 2012; Shrestha et al., 2017).
Even with numerous promising methods available, retrieving accurately disease mechanisms continues to be a challenge. Moreover, many of them are no longer updated with novel methods, and their stated efficacy is questioned when applying cross-validation frameworks (Picart-Armada et al,. 2019). On the other side, studying oncogenic genes usually implies working with hub proteins (proteins that have a high number of interactions), such as TP53 and EGFR, which have numerous high-confidence connections that can introduce bias in heat diffusion models. These highly mutated genes, or "hot nodes," can transfer "heat" to neighbouring genes that may not be biologically relevant, skewing the results of the interactomes (Galindez et al., 2022; Demirel et al., 2022).
The Hierarchical HotNet (HH) has previously proven to predict the highest numbers of candidate cancer genes in comparison to other network-based methods addressing the limitations described (Dittrich et al., 2008; Horn et al., 2018; Cho et al., 2016) by (1) constructing a similarity matrix of each pair of vertices from the vertex-weighted graph G and topology using a random walk-based approach (a specific type of heat diffusion); (2) creating a dendrogram T of high δ that represents a hierarchy of clusters with a high degree of connectedness eliminating the clustering parameter selection (sensitive to the training dataset and computationally expensive to extract), (3) determining the statistical significance of clusters in T, using three null distributions (general randomization, weights randomization and topology randomization) that preserve different features, calculated as the ratio R of the size of the largest observed cluster at δ to the expected size of the largest cluster at δ based on the null distribution, and (4) performing a consensus between clusters from multiple networks reducing artifacts (Reyna et al., 2018).
Here, a Python implementation of HH has been developed for analysing any PPi network without edge weights. Using as basis the connected subnetwork obtained from previous steps, it is possible to create two distinct networks of same topology. Each one possesses a different set of scores derived from DEA, the and absolute LFC values, ensuring positive values for importance increments (Escala-Garcia et al., 2020; Shahamatdar et al., 2020).
For each variant, two sets of altered subnetworks (LFC and q-value types) are identified, and a final consensus is easily performed using NetworkX to integrate these subnetworks across variants within the same cluster
(Hagberg et al., 2008). NetworkX is also used to find the shortest paths between disconnected altered subnetworks, revealing key intermediary proteins that bridge separate regions and offering a general view of the affected pathways. As expected, it is shown the hub role of KRAS and TP53. However, the perturbed gene may be manually added if not be included by HH, which occurs when there is a loss-of-function without affecting its normal expression, so LFC and p-value remain invariant. In turn, the rest of nodes are annotated with the mean LFC calculated across variants of the same cluster in which it is reported significantly expressed with a certain frequency, at least twice here to not filter possible key nodes, but considering a reasonable coefficient of variation (CV) under 50% (
Figure 15,
Figure 16 and
Figure 17).
Figure 4.
Working flow of HH implementation and NetworkX for TP53 experiment.
Figure 4.
Working flow of HH implementation and NetworkX for TP53 experiment.
HH has been used to find significantly mutated modules related to certain cancer types (Escala-Garcia et al., 2020; Liu et al., 2020; Yepes et al., 2020). However, oncogenic genes change completely the cell expression profile, leading HH to reveal major altered subnetworks, which do not represent coherent functional units (Dittrich et al., 2008; Horn et al., 2018; Cho et al., 2016; Barel et al., 2020).
Figure 5.
Connected altered subnetworks of clusters 2 (top) and 3 (bottom) of variants with significantly expressed genes continuous mapped in red-blue, added intermediators in yellow, and perturbed TP53 in the red square.
Figure 5.
Connected altered subnetworks of clusters 2 (top) and 3 (bottom) of variants with significantly expressed genes continuous mapped in red-blue, added intermediators in yellow, and perturbed TP53 in the red square.
Figure 6.
Connected altered subnetworks of clusters 1 (first), 3 (second), 4 (third) and 5 (forth) of variants with significantly expressed genes continuous mapped in red-blue, added intermediators, and perturbed KRAS in the red square.
Figure 6.
Connected altered subnetworks of clusters 1 (first), 3 (second), 4 (third) and 5 (forth) of variants with significantly expressed genes continuous mapped in red-blue, added intermediators, and perturbed KRAS in the red square.
In this context, here it is purposed to apply MCODE in Cytoscape to detect clusters of nodes more densely connected to each other than to the rest of the network, so added intermediaries are maintained (Bader et al., 2003). Parameters used include a degree cutoff of 2, node connectivity score cutoff of 0.2, and exclusion of single-node connections within clusters. Resulting clusters (annex 1) with at least 3 as complex score are selected for further analysis in order to prioritize more confident findings.
Selected modules (eliminating added interactors) undergo pathway enrichment analysis using Reactome, a highly reliable knowledgebase of human biological pathways
(Vastrik et al., 2007). In other words, STRING interactions are cross-validated against Reactome pathways to ensure data reliability
(Wu et al., 2010). From all pathway annotations reported by Reactome (anx. 2-7), selected ones (
Figure 18) contain an extremely low p-value and FDR, with a reasonable number of entities and reactions found, and coherence with their relevance in lung cancer cells.
3. Results
3.1. Overview of data processing
After data processing and filtering, the proportion of cells and genes retained for the TP53 experiment is 82,963 out of 162,313 cells and 1,902 out of 24,867 genes, while for the KRAS experiment, it is 91,608 out of 150,043 cells and 3,314 out of 24,490 genes.
On top of the highest expressed genes, FTH1 and FTL, involved in iron metabolism, are prominent (
Figure 7). These genes, along with TF, TfR1, and Fpn, constitute ferritin, the primary intracellular iron storage protein
(Bogdan et al., 2016). Ferritin deregulation is observed in most acute myeloid leukemia (AML) patients, likely due to inflammation
(Bertoli et al., 2019). On the other hand, these genes are implicated in ferroptosis, a process that leads to lethal lipid peroxidation and non-apoptotic cell death driven by accumulated iron-dependent lipid Reactive Oxygen Species (ROS). While ferroptosis has shown importance in cancer therapeutics, its roles in tumorigenesis and development remain unclear
(Chen et al., 2020; Lu et al., 2018).
3.2. Characterization of variant profiles
In the TP53 experiment, variants are grouped into three defined classes as shown by the heatmap of the Spearman correlation, with impactful (loss-of-function) variants coloured in black (
Figure 8 and 9). Top Hotelling T² scoring variants include E286K, H193L, G245V, and G244C, all missense mutations with high prevalence in breast invasive ductal carcinoma and lung adenocarcinoma, according to the GENIE project of the AACR
(AACR Project GENIE Consortium, 2017).
In the KRAS experiments, variants show a continuum of gain-of-function phenotypes, whose functional impact cannot be inferred only by their frequency in patient cohorts as reported (
Figure 10 and 11)
(Ursu et al., 2022). Top Hotelling T² scoring variants include G13R, K117N, Q61L, and G13E, all missense mutations with high prevalence in lung adenocarcinoma and colon adenocarcinoma
(AACR Project GENIE Consortium, 2017).
3.3. Insights from Leiden clustering
When clustering perturbations based on transcriptional profiles, the Leiden algorithm faces challenges due to cellular plasticity and heterogeneity. Even when adjusting the resolution to produce an equal number of clusters as obtained by the Spearman hierarchy, cells carrying the same variant do not cluster together but rather distribute across clusters in similar proportions between variants (
Figure 12-15).
Conversely, for the 15 GPs obtained in TP53 experiment, GP 0 points out an alteration on spliceosome, GP 1 on the TNF signalling pathway, GP 2 on tyrosine kinase inhibitor resistance, GP 3 on proteasome, GP 8 on cell cycle, GP 9 on p53 signalling pathway, GP 11 on IL-17 signalling pathway, GP 12 on cell cycle, GP 13 on ribosome. On the other hand, for the 15 GPs obtained in KRAS experiment, GP 0 on spliceosome, GP 3 on adhesion, GP 4 on chromatin remodelling, GP 5 on p53 signalling pathway, GP 6 on TNF signalling pathways, GP 7 on IL-17 signalling pathway, GP 9 on proteasome, GP 10 on cell cycle, GP 13 on DNA replication.
3.4. DEA insights
LCFs and q-values dataframes, along with HotellingT2 scores, are particularly useful for determining canonical signatures, such as the underexpression of CDKN1A and RPS27L previously reported in TP53 experiments
(Jeay et al,. 2015). Notably, variants E286K, H193L, G245V, and G244C rank high in Hotelling T
2 scores for the TP53 experiment, with PyDESeq2 identifying 135, 147, 137, and 147 significantly expressed genes, respectively (
Figure 17). Despite other variants having more significantly expressed genes, these top variants likely induce larger changes (high LFC values) in a critical subset of genes, like CDKN1A and RPS27L. A similar pattern is observed in the KRAS experiment.
The integration of LFC and q-values is essential for identifying cancer driver genes and affected pathways, as the impact of a variant cannot be fully measured by the number of significantly expressed genes alone. The focus considered here should be on the magnitude of changes in critical genes to better understand and identify key drivers and pathways in cancer research.
3.4. Subnetworks identification and annotation
In the context of KRAS, despite variations among its clusters, these exhibit similar affected biological processes, reflecting the mentioned continuum of phenotypes. For instance, a common annotation previously reported is that oncogenic KRAS induces the loss of the redox master regulator Nfe2l2/Nrf2 (which regulates mRNA translation) to stimulate pancreatic and lung cancer initiation. Mutated TP53 also has been significantly altering Nfe2l2 pathways (Cancer Genome Atlas Research Network, 2008). Among the significantly expressed genes, HH includes in the altered subnetwork the Nrf2 target genes like NQO1 and HMOX1, as
Figure 8.
Hierarchical dendrogram based on Spearman correlation for TP53 experiment. Impactful variants are in black.
Figure 8.
Hierarchical dendrogram based on Spearman correlation for TP53 experiment. Impactful variants are in black.
Figure 9.
Heatmap of z-score of most variable genes for TP53 experiment.
Figure 9.
Heatmap of z-score of most variable genes for TP53 experiment.
Figure 10.
Hierarchical dendrogram based on Spearman correlation for KRAS experiment. Impactful variants are in black.
Figure 10.
Hierarchical dendrogram based on Spearman correlation for KRAS experiment. Impactful variants are in black.
Figure 11.
Heatmap of z-score of most variable genes for TP53 experiment.
Figure 11.
Heatmap of z-score of most variable genes for TP53 experiment.
Figure 12.
Leiden clusters of cells represented with UMAP (top) and the distribution of variants on the clusters (left) for KRAS experiment.
Figure 12.
Leiden clusters of cells represented with UMAP (top) and the distribution of variants on the clusters (left) for KRAS experiment.
Figure 13.
Leiden clusters of cells represented with UMAP (top) and the distribution of variants on the clusters (left) for TP53 experiment.
Figure 13.
Leiden clusters of cells represented with UMAP (top) and the distribution of variants on the clusters (left) for TP53 experiment.
Figure 14.
UMAP scatter of E286K variant against WT for TP53 experiment.
Figure 14.
UMAP scatter of E286K variant against WT for TP53 experiment.
Figure 15.
UMAP scatter of G13R variant against WT for KRAS experiment.
Figure 15.
UMAP scatter of G13R variant against WT for KRAS experiment.
Figure 16.
Leiden clusters of genes represented with UMAP for KRAS (top) and TP53 (bottom) experiments.
Figure 16.
Leiden clusters of genes represented with UMAP for KRAS (top) and TP53 (bottom) experiments.
Figure 17.
MA plot with significantly expressed genes in red of G13R variant for KRAS experiment (top) and E286K variant for TP53 experiment (bottom).
Figure 17.
MA plot with significantly expressed genes in red of G13R variant for KRAS experiment (top) and E286K variant for TP53 experiment (bottom).
well as TXNRD1, which serves more as a surrogate marker than a key malignant phenotype determi nant (DeNicola et al,. 2011; Delgobo et al,. 2021). Notably, genes like BCL2L1 and PRDX1, despite being significantly expressed, are not identified by HH, although their relevance in Nrf2 dysregulation according to current literature (Sillars-Hardebol et al,. 2012; Ding et al,. 2017).
Furthermore, oncogenic KRAS induces de novo lipogenesis to promote aberrant growth by activating SREBPs, which are essential in cholesterol synthesis. Significant genes involved in this pathway include SQLE, FDFT1, HMGCS1, IDI1, DHCR7, and FDPS (Ricoult et al,. 2016; De Vusser et al,. 2020), whereas SCD, crucial in growth and metastasis, is not included (Zhao et al,. 2022).
Additionally, type II cytokines such as interleukin-4 and interleukin-13 from Th2 cells in the tumour microenvironment stimulate MYC transcriptional upregulation (included by HH), driving glycolysis (Dey et al,. 2020). Conversely, interleukin-12 triggers JAK/STAT activation, essential for cell survival, proliferation, angiogenesis, invasion, and metastasis, leading to the downregulation of ANXA2, a pro-metastatic gene, the only significantly expressed gene in this pathway detected by HH (Yehia et al,. 2023).
Figure 18.
Selected annotated pathways of detected modules for KRAS and TP53 experiments.
Figure 18.
Selected annotated pathways of detected modules for KRAS and TP53 experiments.
For the TP53 experiment, mutants of TP53 accumulates in the cytoplasm suppressing ROS production to avoid apoptosis through the respiratory complex, in which NDUFA9, MT-ND6 and MT-ND1 participates and are significantly expressed, all incorporated by HH (Kim et al,. 2014).
Similarly, Insulin-like Growth Factor mRNA Binding Proteins are commonly annotated, with CD44 being well-captured by HH implementation. CD44 plays roles in cytoskeletal organization, cell adhesion, migration, and is associated with cancer metastasis influenced by oncogenic factors like TP53 (Huang et al,. 2018; Bell et al,. 2013). However, IGFBP4 and IGFBP6, though significantly expressed and implicated in tumour suppression, are not included by HH, relieving their not-central role in the subnetwork (Gligorijević et al,. 2022).
Alternative RNA splicing, essential for proteomic diversity, often goes aberrant in cancer. The proto-oncogene SRSF3 (SRp20) is significantly expressed and included by HH, although its exact oncogenic splicing targets remain unclear (Ajiro et al,. 2016).
Lastly, dysregulated translation of mRNAs involved in G1 cell cycle progression and S-phase initiation enhances cell size and proliferation (Miyake et al,. 2012). HH detects significant expression of MRPL10 and MRPL13 in the variants cluster, known to be overexpressed and to promote invasion in cancer (Zeng et al,. 2021).
4. Conclusion
This project introduces a comprehensive pipeline that integrates various bioinformatics tools and methodologies derived from extensive bibliographic research. It leverages high-throughput sequencing to analyse differential gene expression and complex altered PPi subnetworks in cancer variants. The pipeline demonstrates excellent performance, successfully distinguishes impactful, identifies variants with similar expression profiles, and shows promising results for retrieving key affected modules already observed in KRAS and TP53 variants. These promising results indicate the potential of the Perturb-seq approach to become a standard strategy for uncovering not only common significantly expressed genes, but also interactors that could be targeted in future therapeutic strategies. All code to recapitulate the analysis is in
https://github.com/jesusch10/perturbseq_analysis along with documentation of pipeline usage.
In contrast, several widely used PPi network databases have been developed, including BioGRID (Oughtred et al., 2019); iRefIndex (Razick et al., 2008): ReactomeFI (Vastrik et al., 2007); and STRING (Szklarczyk et al., 2019). Each database has a distinct topological structure, crucial for identifying candidate driver genes and module structures. However, most existing network-based methods rely on input from a single PPi network, highlighting the need for a method capable of identifying significantly perturbed subnetworks in cancer cells by leveraging multiple PPi networks (Yang et al., 2023).
One way to address this issue is by integrating multiple PPi networks into a single network. Although the HH algorithm applies a consensus approach that combines the topological information of multiple PPi networks, it first identifies subnetworks in individual networks without using information from other networks. This might result in missing structures that could be revealed by considering all PPi networks together (Yang et al., 2023; Reyna et al., 2018). Moreover, HH results depend on the clustering parameter δ, determined by the number of permutations. Despite reducing the number of permutations to 10 to avoid computationally intensive calculations, consistent results are obtained (Ahmed et al., 2020).
Additionally, consensus procedures often assume that every input PPi network has a complete topology covering most interactions (Luck et al., 2020). However, topological information is frequently incomplete, so the integration has been described to produce subnetworks significantly different from the ones obtained with single networks (Swaney et al., 2021). Furthermore, different databases employ diverse procedures, which can describe the same biological reactions quite differently. A more reasonable but labour-intensive approach is manually merging the knowledge from different databases by data curation experts to reduce the fragmentation of knowledge stored in these resources (Wu et al., 2010).
This pipeline effectively reduces experimental complexity and batch effects and decomposes noisy, high-dimensional single-cell data into more interpretable components, enabling the decoupling of common responses to a given group of similar perturbation. By changing the number of modified cells and sequencing depth, next step involves isolating these responses from confounding effects, such as the distribution of variants across cell cycle states, previously assessed as the primary difference between TP53 variants but not KRAS variants (Dixit et al., 2016; Adamson et al., 2016; Ursu et al., 2022). Although this future approach does not determine the mechanism of their impact, it provides hypotheses for them. Perturb-seq can in principle dissect higher order effects, but systematic analysis of genetic interactions remains an ambitious goal to further create an available atlas of variant impact across all key cancer genes and contexts (Ursu et al., 2022). However, in the future it should be considered that there is the possibility by which surface mutations (overall missense mutations under positive selection) could generate a new edge in the network thereby remodelling network architectures (Ozturk et al., 2022).
Despite Perturb-seq significantly reducing time and cost, these still scale linearly with the number of perturbations assayed, whereas the combination space grows exponentially as the order of combinations increases. Nevertheless, Perturb-seq is expected to scale as sequencing costs decrease, making it feasible for targeted screens of a large subset of genes (Dixit et al., 2016). Actually, new approaches for carrying out Perturb-seq involve generating composite samples, either by overloading microfluidics chips to generate droplets containing multiple cells (cell-pooling) or infecting cells with multiple gRNA. These methods lead to substantial cost reductions, maintaining accuracy, increasing the power to detect genetic interaction effects, reducing the number of cells needed for screening, and testing selected combinations (Jaitin et al., 2016; Yao et al., 2023; Wessels et al., 2023). Additionally, costs could be further decreased by advances in long-read scRNA-seq without variant barcoding or using other scRNA-seq methods such as scifi (Lebrigand et al., 2023; Datlinger et al., 2021).