A. Biological Processes that Lead to Gene-level Diversity
As gene expression analyses have become a critical tool for furthering phenotypic, mechanistic, and evolutionary interpretation, it is vital to understand the forces guiding gene expression heterogeneity [
33]. Like other biological processes, inherent biological noise in gene expression has been observed ubiquitously across species [
34]. This stochasticity is driven by many processes within the cell, including transcription/translation initiation and mRNA/protein degradation [
34]. In extreme cases, this stochastic gene expression noise has been shown to reduce fitness in yeast cells [
35]. However, previous studies have noted that genetic and environmental factors are the two main drivers of biological heterogeneity in gene expression [
36]. However, additional intrinsic factors like cell cycle [
37], circadian rhythm [
38], and aging [
39,
40] (which are also influenced by genetic and environmental factors) also contribute to gene expression heterogeneity.
Promoters, enhancers, and transcription factors are key genetic features contributing to gene expression heterogeneity observed across species, tissues, and cell types [
41,
42]. The heavily studied RNA polymerase II core promoter directly regulates gene expression [
43,
44], and natural variations in promoter regions are linked directly to both gene expression and phenotype heterogeneity [
45]. By regulating transcription levels distally, enhancers also influence gene expression heterogeneity within specific cell types, tissues, and even species [
46,
47]. Similar to promoters, alteration in an enhancer region can lead to phenotypic changes by impacting gene expression [
48]. Transcription factors are essential regulatory proteins that drive gene expression by interacting with DNA sequences like promoters and enhancers to control transcriptional processes [
49,
50]. Studies like the Encyclopedia of DNA Elements (ENCODE) project, which integrated over 450 experiments of 119 transcription factors, have demonstrated that transcription factors have dynamic regulatory networks that lead to measurable heterogeneity in homeostatic gene expression [
51].
Additionally, epigenetic processes including DNA methylation, histone modifications, and other environmental or stress responses, can also drive gene expression heterogeneity. DNA methylation, notably mammalian m5C (methyl groups at the 5’ cytosine of a C-G dinucleotide) [
52], regulates gene expression in multiple ways [
53], including through transcription factor binding, the functionality of enhancers, insulator elements, and promoters, and by altering chromatin conformation [
54]. Various studies have noted correlations between gene expression and DNA methylation, further supporting its role as a possible driver for gene expression heterogeneity. Post-translational histone modifications (e.g., acetylation, methylation, phosphorylation, or ubiquitination) are also known to be correlated with gene expression [
55,
56,
57] and can even be used to predict gene expression [
56]. Environmental and stress-related effects, like hypoxia, can also impact the heterogeneity of gene expression. Many organismal studies have observed the impact of stress on producing a biological response and subsequent regulation of various genes to alleviate environmental damages (e.g., in oxidative stress) [
58,
59,
60].
B. Methods for Quantifying Gene-level Transcriptome Diversity
Researchers have applied different approaches to empirically calculate gene expression heterogeneity for both bulk and single-cell/nuclei transcriptome profiles, including coefficient of variation (CV) [
23], variance [
61], and others [
62]. While the gene expression terms variation and diversity both describe changes in gene expression across samples, variation/variability are more frequently associated with measures of dispersion (e.g., CV, variance), and diversity is more commonly associated with these probability-based measures, particularly Shannon or information entropy (
Figure 2A). In fact, the application of CV and variance to gene expression profiling analysis is sometimes known as expression variance (EV) [
62]. Originally described by Alemu et al., EV showed tissue-specific variation across gene expression profiles [
63] and was later used to show expression variation associated with aging and methylation [
64].
Standard deviation describes the dispersion of the data in relation to its mean. Building on standard deviation, CV considers the standard deviation of the gene expression sample divided by its mean and thus is a standardized measure [
23]. Therefore CV can be used to compare across conditions or datasets to identify disease-associated genes that are not identified by DE alone [
23]. Additionally, other studies have applied both technical CV and biological CV (BCV) to describe RNA-Seq gene expression variation associated with technical or biological variables, respectively, as well as [
21] normalized CV to examine gene expression variation, for example, across neurological diseases [
61]. Recent studies have also used CV to understand how gene expression variability among therapeutic targets determines drug effectiveness and safety, thus improving therapeutic development methodologies [
65]. Another empirical measurement of gene expression is variance. In the Mar et al. study, variance measures the significance of the mean difference between groups by using a t-test or ANOVA [
61], but the term has also been used synonymously with gene expression variability [
61,
64,
66,
67]. For example, Bachtiary et al. applied variance (here defined as standard deviation squared) to measure the variation of expression between and within cervical cancer patient samples [
68]. Gene expression variance has also been studied in human populations, where functional connections between low-variance genes and fundamental cell processes and high-variance genes with immune processes suggest that variance is biologically meaningful and not merely reflective of stochastic noise [
69].
Though CV and variance are some of the most common methods for empirically calculating variation, there are a few other ways of describing variation across gene expression. For example, differential variability analysis can also be performed with Bartlett's, Levene's, median absolute deviation (MAD), or Fligner–Killeen tests, yet the R package MDSeq based on reparameterization of the real-valued negative binomial, which was shown to outperform these methods [
19]. On the other hand, the range of gene expression observed is one of the simplest measures of variability. Though generally not used in its simplest form (i.e., maximum value minus minimum value), a modified version of range has been used. For example, dynamic range, the log10 ratio between the maximum and minimum normalized gene expression counts, has been used to compare the expression of orthologous genes between humans and mice to determine genes constrained throughout early vertebrate evolution [
70] as well as to describe gene expression variation patterns across organs and tissues [
71]. Additionally, researchers have developed a metric based on a ratio of the percentage of reads covering a proportion of the genome to quantify gene expression variation [
72]. When a large percentage of reads covers a smaller number of total genes in the genome, it indicates lower variability in that condition than when the percentage of reads spans over a larger set of genes in another condition. However, these metrics are biased towards longer genes if gene size is not properly accounted for during analysis.
In 1948, Shannon defined entropy as the probability of uncertainty of an outcome or the amount of choice in the outcome based on how much information [
73]. The basis of Information Theory, Shannon entropy, is the log of the event probability so that an event with full certainty or a probability of one would have no surprise. Over the years, Shannon entropy has been applied to numerous biological processes, including gene expression [
74]. When using Shannon entropy in this context, gene expression measurements for a specific gene are the information used to measure uncertainty, or as we describe it, diversity [
75] (
Figure 2). Previous studies have employed Shannon entropy to study diversity in drug targets [
76], tissue-specificity [
77], species-specificity [
75], and even intraspecies genomic DNA information [
78]. When used to compare gene expression in RNA-Seq data, differential Shannon entropy, compared to differential CV and DE, identified genes overlapping with CV-identified genes but also included unique disease-associated genes [
23], underlining that Shannon entropy can identify biological signals that CV and DE do not. Shannon entropy has also been used in combination with WGCNA analyses by calculating entropy from the betweenness of networks [
79]. Additionally, studies using adaptations of Shannon entropy, such as Tsallis entropy (also known as HCDT entropy), have divided gene-level diversity into two categories: alpha and beta diversity [
80], where alpha diversity represents the diversity of a single profile, and beta diversity represents diversity between samples within a group. This particular example of Tsallis entropy allows a researcher to able to manipulate a parameter (q) that can adjust the weight of highly-expressed genes [
80], therefore giving a higher degree of control and leaving room for interpreting more biologically- relevant information at different levels of q. The introduction of alpha and beta diversity nomenclature is an eloquent way to describe the diversity shown in
Figure 1A, with alpha diversity representing diversity across genes or transcripts within a sample and beta diversity being the two-dimensional diversity across all samples in a group or population, though this nomenclature is not yet widely used. Example analytical packages that apply entropy and variation in the context of gene expression diversity are described in
Table 1, although many of these analyses are performed without specialized software.
Altogether, the aforementioned gene expression studies demonstrate not only the importance of further understanding the drivers of this gene expression diversity but also the importance of developing new and comprehensive ways to quantify this diversity through various methodologies. Quantifying gene-level transcriptome diversity is a salient part of ascertaining how biological processes lead to phenotypic manifestations, including in a disease context. Therefore, it is imperative to examine other sources of diversity, such as heterogeneity in mRNA transcripts due to AS.