Introduction
The combination of active genes and the interactions between them within a cell generates a gene regulatory network (GRN) that enables a cell to specialize in fulfilling its function. Therefore, an improved understanding of GRNs paves the way to use such information not only as a means to understand the principles of gene regulation but also as a tool to drive cell fate for purposes such as cellular engineering, or to prevent disease outcome [1]. There are several approaches to infer GRNs from omics data [1,2,3]. These approaches rely on a combination of biological and statistical information to identify pairs of related genes for the construction of GRNs. Approaches such as co-expression network analysis aim to build a network by identifying genes whose expression profiles are correlated. While these approaches systematically examine the genome and as such are hypothesize-free, the retrieved network is undirected due to the symmetrical nature of correlations. Other methods try to resolve this issue by including biological information. Notably, they use data from CHIP-seq and cis-regulatory elements to identify transcription factors that bind to target genes and as such turn the undirected network into a network that its edges indicate causality. The gene regulatory network obtained from such approaches does not cover the interactions among gene products, (e.g., functional interaction at the level of pathways). Furthermore, they assign transcription factors to their target genes based on genomic proximity which could introduce bias in the presence of distal regulatory effects [1,2,3]. Another factor that limits the practicability of current approaches is that they require access to individual level data which is not always feasible for reasons such as participants privacy or logistical considerations. In this study, I describe a method based on eQTLs that can address the aforementioned issues.
An eQTL or an expression quantitative trait locus is a site on DNA that variation in its sequence impacts the expression of a gene. If an eQTL is located near the gene it acts upon, it is referred to as a cis-eQTL; however, if it is located distant from its gene of origin, sometimes on a different chromosome, it is referred to as a trans-eQTL. Over the past decades, high throughput studies have been developed to map the eQTLs. Results from these studies which basically summarize the magnitudes and the natures of associations between genomic variants (eQTLs) and genes then are considered collectively to investigate the genetics of transcriptome. An insight from these studies is the evidence that trans-eQTLs tend to be tissue specific [4,5,6]. A new development in this field is the advent of statistical methods that can leverage publicly available GWAS summary statistics including eQTLs to investigate the nature of relation between two biological entities (e.g., two genes) [7,8]. A prominent method in this regard is Mendelian randomization (MR) that can not only test the association between two genes, but also differentiate between causation and correlation [9,10]; moreover, because MR uses a set of independent SNPs for association testing, the results are immune to the bias that may be introduced by the environmental (non-genetic) factors. Building upon these notions, in this study, I devised a workflow that can complement the previous approaches for constructing tissue-specific GRNs.
Methods
Data Sources
Previously, the eQTLGen consortium [11] has investigated the genetic architecture of blood gene expression. The data in the eQTLGen Consortium consist of 31,684 blood and PBMC samples (19.6% of samples) obtained mainly from studies conducted in European populations. Gene expression profiling of samples were carried out using expression arrays and RNA-seq (20.3% of samples). By conducting a meta-analysis, the authors combined the eQTL summary association statistics from 37 studies [11]. The outcome of the meta-analysis revealed cis-eQTLs for 16,987 genes; furthermore, by selecting 10,317 trait-associated variants, the authors identified trans-eQTL for 6,298 genes. Given that 95% of samples in the eQTLGen database are obtained from subjects of European origin. I used the genotype data from the 1000 Genomes-European population to compute the degree of linkage diseqlubrium (LD) among SNPs. Similarly, to examine the association between gene expression data and phenome, GWAS summary statistics for traits was obtained from studies conducting using samples from the European population.
Mendelian Randomization
I used the procedure outlined in
Figure 1 to identify gene(s) through which a trans-eQTL exerts its impact. Initially, trans-eQTLs that mapped to the HLA region (coordinates: chr6:28,477,797-33,448,354, based on GRCh37) were removed. This is due to the complex linkage disequilibrium structure of the HLA region and the bias it may introduce in the results. Then trans-eQTLs and cis-eQTLs were matched to find eQTLs that show both cis and trans regulatory effects. Through this procedure, I identified 55,582 gene pairs that shared at least an eQTL (P<5e-8). Next, I used Mendelian randomization to investigate if change in the expression of the source gene (gene that was linked to the cis-effect of an eQTL) impacts the expression of the target gene (gene that was linked to the trans-effect of the eQTL). Mendelian randomization is a statistical test that can investigate the nature of relation between two biological entities by comparing their magnitudes of association to several genetic markers. The test uses a set of independent (in linkage equilibrium) SNPs that are associated with the predictor (e.g., the expression of the source gene) to investigate the impact of change in the level of the predictor on the outcome. For the purpose of this study, I used the GSMR algorithm to conduct the MR analysis [12]. By comparing the files containing eQTL summary statistics for the source gene and the target gene, the algorithm finds a set of shared SNPs that are in linkage equilibrium (r
2<0.05) and significantly associated (P<5e-8) with the expression of the source gene to conclude whether change in the expression of the source gene impacts the expression of the target gene. Of note, given that a significant association between the two genes could indicate correlation. Namely, the situation that the expressions of both genes are controlled by a transcription factor. SNPs with pleiotropic effect (i.e., source gene ← SNP → target gene) were identified and removed from the analysis using the pleiotropy test implemented in the GSMR algorithm. The test which is known as heterogeneity in dependent instruments (HEIDI) can identify a pleiotropic eQTL by comparing the association of its adjacent SNPs with expression of both genes. In this context, if a heterogeneity is not observed in the pattern of associations, it indicates a pleiotropic effect. As compared to other methods, GSMR algorithm accounts for the sampling variance in β (beta) estimates and the linkage disequilibrium (LD) among SNPs, as such it provides more statistical power. GSMR also incorporates a variety of quality assurance and helpful functions, notably aligning both GWAS summary datasets to the same reference allele at each SNP and a clumping function to only keep non-correlated (r2<0.05) significant SNPs (with association P-value < 5e-8) in the instrument.
Following MR analysis, gene pairs (N=2,350) that showed significant evidence of association (P<5e-8) were selected and plotted using the Cytoscape software (version 3.10.1 ) [13] in order to find if they form a network. DAVID functional tool (version 2023q4) [14] was used to identify biological processes that are overrepresented among the identified genes.
Conditional Analysis
To investigate if the association between a gene and a trait is mediated through a blood trait conditional analysis was conducted. For this purpose, I adjusted the entire GWAS summary statistics of the trait for the covariate of interest using the mtCOJO algorithm [10]. Next, the association between the gene expression and the trait was re-calculated and compared with the results from the initial model (unadjusted for the covariate). mtCOJO uses a number of parameters, including genetic correlation, heritability and causal effect, to produce the adjusted data. It requires only GWAS summary statistics and is known to be free of bias due to shared environmental or genetic effects [10].
eQTL Pruning
To understand the properties of eQTLs underlying a gene for downstream analyses. I generated a list of independent (r2<0.05) eQTLs per gene using the clump algorithm implemented in PLINK (v.1.9) [15]. In summary, the algorithm takes a list of eQTLs and their P-values, conducts LD pruning, and returns a list of eQTLs in linkage equilibrium and prioritized by P-values.
Following the LD pruning, the phenotypic variance (
, proportion of variation in a gene expression) attributed to an eQTL was calculated, as previously described[16] using the equation:
where
is the frequency of effect allele and
is its regression coefficient derived from the association model. eQTLGen consortium reported Z-scores instead of regression coefficients. As such, a conversion was made using the equation (Zhu et al., 2016):
Discussion
Gene regulatory network (GRN) enables a cell to specialize in carrying out its function. There are various cell types in the body and identification of their GRNs are important for various biological purposes including to better diagnose and treat diseases. There are currently several approaches that can investigate such networks by analyzing the raw data available at individual levels. This hinders the possibility of collaboration among researchers in order to combine their data and to map gene-regulatory networks with higher statistical power. Furthermore, as indicated in the introduction, the existing methods make a number of assumptions for generating GRNs; however, such assumptions could introduce bias in presence of alternative scenarios. The method proposed in this study provides several benefits. First, it can scan the genome systematically (hypothesis-free) and identify genes whose transcripts are related. Second, it uses SNPs (eQTLs) to test the association between two gene transcripts as such, it is undisturbed by the impact of confounding environmental factors, because the distribution of alleles at conception is a random process (Mendel’s law of segregation). Third, it can differentiate between causation and correlation because it relies on non-pleiotripic, independent SNPs to investigate the nature of association between two transcripts. Finally, it relies on summary association statistics that are publicly available as such, it provides a convenient path for researchers who wish to share their data or combine findings from several studies to identify GRNs with higher statistical power and without privacy concerns.
By applying the devised workflow to eQTL data for blood, I detected gene-pairs that aggregated into a gene-regulatory network. Examining the function of the network revealed it is associated with hemo-immune processes which is expected considering the examined eQTL data are from blood. The topology of the network resembled the properties of a scale free network [
27]. A core of 8 genes accounted for 20% of the interactions and the distribution of interactions per gene followed a power law distribution (
Figure S1). If this happens to be the case in other cell types then this provides a convenient path for therapeutic interventions, because a scale free network is manageable by targeting its hub genes.
In this study, I noted that a gene in the network is under the regulatory impact of about 18 eQTLs as such, mendelian randomization is a well-suited tool tend to detect functional interactions, however, to fairly examine the association between two genes, access to full GWAS summary statistics data is required. This is important considering that two genes could be on different chromosomes and under the influence of different trans-regulatory elements. Access to full GWAS summary statistics is also important to compute gene expression heritability estimates. In this study, I noted a positive correlation between higher values of gene expression heritability and likelihood of a gene exerting regulatory impact on the network. If this is proven to be true in other studies, such as twin or population studies that have access to genotype data. Then gene expression heritability can be used to identify genes in a tissue that have high regulatory impact.
By examining the association of the network’s hub genes with the phenome. I found GSDMB and ORMDL3 impact several diseases of immune origin. Higher expression of these genes contributed to higher risks of allergic disease, asthma, atrial fibrillation but lower risks of primary biliary cholangitis, inflammatory bowel disease, rheumatoid arthritis and lower levels of HDL. This finding indicates therapeutic approaches that aim to change the expression of these genes, should find an optimal threshold that balances the antagonistic peliotropic effect of these genes. Otherwise, side effects are expected.
ORMDL3 is localized to the endoplasmic reticulum and regulates downstream pathways including sphingolipids, metalloproteases, airway remodeling, and chemokines.
GSDMB encodes gasdermin B, a member of gasdermin domain containing proteins which are involved in different processes associated with cellular state, such as cell cycle control, differentiation, and apoptosis [
28] . The analyses revealed higher expression of these genes have concordant effects on several disorders of immune origin. Therefore, they may act through the same molecular pathway to impact diseases of immune origin. In this regard, inflammation is notable because
ORMDL3 is involved in the development of the unfolded protein response, which triggers an inflammatory response including the formation of gasdermin pores on cellular wall and pyroptosis.
I also found an association between higher expression of
HHEX gene, with lower level of triglycerides and lower risk of T2D diabetes. Previous studies have documented the role of
HHEX gene in metabolism.
HHEX encodes a transcription factor that is required for the normal development of organs such as pancreas, gall bladder, bile duct and liver. Within adult endocrine pancreas,
HHEX is selectively expressed in somatostatin-secreting delta cells, which are important for maintaining islet function [
29].
HHEX was initially discovered for its role in hematopoiesis; however, I found adjusting for the impact of blood traits, does not change the observed association with triglycerides level and the risk of type 2 diabetes indicating the association of this gene with metabolic traits is independent of its function in hematopoiesis.
This study is not without limitation, although eQTLGen [
11], consortium has a descent sample size and as such well-powered for detecting trans-eQTLs; however, the authors provided summary association statistics for only 10,317 trans-eQTLs that previously showed association with the phenome. Of note, this could introduce a selection bias if trans-eQTLs are selected with regard to a specific category of traits; however, this was not the case in the eQTLGen study [
11], namely, the selection of SNPs for trans-eQTL mapping was not with regard to a specific category of traits (
Table S5).
In summary, this study describes a workflow to identify tissue-specific gene-regulatory networks. It indicates eQTLs can provide insight into cellular processes, if such data become available for various cell types through statistically well-powered studies. The identified network showed scale-free topology, if further researches substantiate this finding, then in each network, targeting the hub genes could provide a solution to treat disease outcomes.