2. Terminology in Allopolyploid Gene Expression
Recent studies on gene expression in allopolyploids have revealed confusion regarding the terminology used. Clear definitions are essential to avoid misunderstandings and ensure effective scientific communication. This confusion often stems from variations in analytical methods and differences in the target organisms studied.
The term “genome dominance”, “genomic expression dominance” or “expression dominance” in allopolyploids was originally described by [
19]. They used these terms to describe situations where the total expression level of a gene or group of homoeolog genes equals that of one of the parental genomes. Their study focused on a synthetic allopolyploid (A2D1), obtained from the inter-specific cross between the two diploid species
Gossypium arboreum (genome A2) and
Gossypium thurberi (genome D1) followed by chromosomal doubling with colchicine. In this synthetic allopolyploid, they observed that the expression levels of approximately 11,000 genes, both up- and down-regulated, were statistically equivalent to those of the D1 genome. Further studies on genome dominance in allopolyploid cotton were conducted by [
20,
21] applied the same concept to Coffea allopolyploid. Similarly, the term “parental dominance” has been used to describe a condition where the overall gene expression of the allopolyploid mirrors the overall expression level of one parent [
22,
23,
24].
In other studies, the term “genome dominance” or “sub-genome dominance” is used to describe certain phenomena, e.g., lower DNA methylation, higher gene retention and preferential expression, attributable to one of the parental genomes of the allopolyploid [
25,
26,
27,
28,
29]. Recently [
30] defined “genomic dominance” in allopolyploids more generally as the situation in which one parental genome becomes dominant over the other. [
31] suggested replacing the term “genome dominance” as defined by [
19] with “expression level dominance”, to avoid the ambiguity of the word “genome”. Furthermore, they pointed out that “expression level dominance” cannot be detected without measuring the expression level in the two parental genomes. Therefore, it is difficult to calculate this value in paleopolyploids because the parents may have become extinct or have undergone substantial genomic and/or transcriptomic changes [
32]. This concept was also highlighted in a review by [
33]. The “expression level dominance” is independent of the relative contribution to expression level of the two homoeologs. In other words, an allopolyploid derived from two parental species (A and B), which differ in the expression level of gene X (10 and 20, respectively), might express gene X (as the overall expression of the two homoeologs) at the same level as one of the parents (i.e., 10 or 20). This expression level is independent of the relative contribution of the two homoeologs in the allopolyploid, as they could contribute equally or unequally to the total expression of gene X. Currently, several studies have used the term “expression level dominance” as a synonym for the term “genome expression dominance” sensu [
19,
29,
30,
34,
35,
36,
37,
38]. “Expression level dominance” could be balanced (the same number of duplicated gene pairs in the allopolyploid have the overall expression level of one of the parental species, e.g., considering 10 duplicated gene pairs in the allopolyploid, the total expression of 5 homoeolog gene pairs is equivalent to one of the parents while the total expression of the remaining 5 homoeolog gene pairs is equivalent to the other parent) or unbalanced (more homoeolog genes than half have the expression level of one of the parental species). According to [
31], the term “homoeolog expression bias” can be used to describe the condition in which a duplicated gene in an allopolyploid is preferentially expressed by one homoeolog, generally resulting in an unequal expression level between the homoeologs (
Figure 1). As observed for expression level dominance, homoeolog expression bias could be balanced (there is an equal number of genes preferentially expressed by each homoeolog, e.g., considering 10 duplicated genes, 5 are preferentially expressed by one homoeolog and the remaining 5 are preferentially expressed by the other) or unbalanced (the number of expressed genes favor one homoeolog over other). In other words, if the expression level of a gene is equivalent between the two diploid parents but becomes unequal between the homoeologs in an allopolyploid, then the gene exhibits an expression bias. Similarly, this bias is also observed if the diploid parents display non-equivalent expression levels while the homoeologs in the allopolyploid show equivalent expression levels [
39]. This last case is more described as “no bias in progeny” [
35,
36,
38], whereby, an expression bias in the parent is reverted to non-differential expression between homoeologs. It can be deduced that to define an expression bias between homoeologs, as in the case of “expression level dominance”, it is necessary to also take into consideration the expression levels of the parental genomes. Therefore, three conditions generally occur (synthetically originated, naturally originated, missing parents).
In the “synthetically originated” condition, the allopolyploid has been synthetically originated and the expression levels of its parents are available [
35]. In this case, there is great interest in studying the phenomena that are generally defined as ‘genome shock’ and ‘transcriptome shock’ [
40,
41,
42]. That is, all those phenomena of rapid reorganization of genome and gene expression that occur after hybridization and duplication.
In the “naturally originated” condition, the allopolyploid originating from the natural hybridization of its two parent species and expression levels of its parents are available [
38]. In this case, it is particularly interesting to observe which genetic, genomic, and epigenetic changes have occurred since the hybridization and polyploidization event over a long period. Nevertheless, the great challenge is represented by the identification of the right parentals (i.e., the expression profile of the parental species that generated the allopolyploid). Considering that not only the allopolyploid but also its parents, assuming they still exist, have undergone evolutionary mechanisms, we cannot be certain that the identified parents are truly representative of the original parents, especially due to polyploidization events that occurred several million years ago [
32].
In the two conditions described above, if the expression levels of the parents are not equal, the expression bias of the homoeologs has been evaluated by comparing the expression value of each homoeolog with a parental mix expression value [
20] or more frequently with the parental expression level estimated separately [
35,
38].
In “missing parents” condition, the parental expression levels are not available, as long as it is possible to identify the parental genomes in the allopolyploid, and an equal expression level of the two parental genomes is usually assumed [
25,
43]. If we assess homoeolog expression bias without knowing the parental expression levels, we cannot dismiss the possibility that the bias between homoeologs is merely a reflection of parental inheritance. Thus, the expression differences observed between homoeologs might have already been present in the parental species (
Figure 1). In other words, for a given gene or group of genes, the relative contribution of expression level between homoeologs in the allopolyploid is equal to the relative contribution of expression level between parents. This situation could be described as an additive expression of homoeologs [
33,
44]. But generally, the term “additive” describes a situation in which the total expression of both homoeologs in an allopolyploid, for a gene or a group of genes, equals the arithmetic mean, also known as the mid-parent value (MPV) [
20,
24,
34,
35,
38], or as discussed by [
45], the sum of parental species expression (also defined as summed-parent value “SPV”). According to [
45] the term “MPV” is used to calculate additivity by assuming that each parent contributes half of their genome to the offspring. However, in the case of neo-polyploidy, where the offspring receives each parent’s entire genome, parental values should be summed, not averaged, if they are truly additive. This is because the entire genome, rather than half, is passed on to offspring. However, MPV is the generally used method to calculate the additive expression value (
Figure 1). As discussed by [
33], there are at least two ways in which MPV is calculated. Either by measuring gene expression levels in a mixture of RNA from both parents, as demonstrated in studies using microarray approaches [
46,
47,
48], or, more recently, through RNA-sequencing technology, where reads are counted relative to the expression levels of each parent separately and then averaged [
34,
35,
38]. For example, suppose we have two parental species A and B, that differ in the expression level of the X gene (10 and 20, respectively). The MPV of X gene is (10 + 20)/2 = 15. If the allopolyploid derived from the two parental species A and B has an expression of X gene equal to 15, then we can say that the expression of X gene is additive in the allopolyploid. The additive condition of parental expression is generally used as reference to compare the overall allopolyploid expression of a gene or a group of genes, i.e., the zero hypothesis. If overall expression of allopolyploid is not additive, then it might be equally to one of the parents (expression level dominance) or it might be lower (transgressive down-regulation) or higher (transgressive up-regulation) than both parental expressions (
Figure 1). In transgressive expression the gene expression is outside the range of parental expression level. For example, if the expression level of the X gene is 10 and 20 in parental species A and B, respectively, then the expression level of the same gene in allopolyploid is transgressive up-regulated if it is higher than 20, or transgressive down-regulated if it is lower than 10. In practice, differentially expressed genes (e.g., |log2 fold change| ≥ 1 and padj ≤ 0.05) between the allopolyploid (considering the joint expression level of both homoeologs) and its parents are usually binned into 12 conditional states. The first binned could be done dividing genes whose expression is not distinguishable from additivity (e.g., padj ≥ 0.05) and distinguishable from additivity (e.g., padj ≤ 0.05). Genes in the latter group could be further classified into more specific nonadditive states (i.e., expression level dominance and transgressive down/up regulated) [
19,
20,
34,
35,
38]. According to [
19], the 12 differentially expressed states, distinguished by Roman numbers, are: additive (categories I, XII), expression level dominance (categories II, IV, IX, XI) and transgressive regulation (categories III, V, VI, VII, VIII, X) (Figure1). Genes not differentially expressed between the allopolyploid and the parents are assigned to the “no change” state (
Figure 1). It is important to note that conditions mentioned above can be influenced by various factors, including the specific genes involved, the tissues being analyzed, and the environmental conditions [
36].
3. How Allopolyploidy Affects Bioinformatics Transcriptome Analyses?
Analysis to estimate gene expression profiles in allopolyploids is challenging, particularly in the steps of read mapping and normalization of gene expression levels. Intrinsic genomic redundancy complicates the analysis, the complexity of which is further amplified in polyploids due to the existence of multiple sets of chromosomes and homoeological variation coexisting with allelic variation. Allopolyploidization leads to genetic, epigenetic, transcriptomic and proteomic variations, compared to parental species, often associated with the activities of repetitive sequences and transposon elements [
41,
49,
50,
51].
As discussed above, in allopolyploids, the relative expression levels of homoeolog gene pairs can be unequal. While, compared to the parental species, the cumulative expression of each homoeolog pairs can be additive or not. Furthermore, homeologs could be mutually silenced on a tissue-specific basis [
36,
52]. Early research, based on the use of microarrays, often failed to distinguish expression between homoeolog genes [
19,
48]. This limitation meant that scientists could only assess the cumulative expression of a target gene, lacking information on the regulation of individual homoeolog copies, i.e., the relative contributions of each homoeolog copy to the overall expression were elusive. Consequently, when comparing gene expression between allopolyploid species and their parents, scientists could only infer whether the combined expression of homoeolog copies was comparable to MPV or parent’s expression levels.
High-throughput RNA-sequencing and the availability of bioinformatics and genomic resources (such as the OneKP project, an international initiative that has generated large-scale gene sequencing data for over 1,000 phylodiverse plant species [
53], and PloiDB, a database of ploidy estimates for seed plants ([
54]), have significantly improved our ability to define gene expression profiles in allopolyploids and appropriately distinguish gene expression levels of homoeolog genes.
Gene expression levels in both allopolyploids and diploid organisms are typically measured using high-throughput RNA-sequencing (bulk RNA-seq). It targets mRNA and/or non-coding RNA libraries for the identification of differentially expressed genes. It aims to estimate the average of global expression values in each sample, elucidating the molecular mechanisms involved in a particular tissue at a given time point. In these experiments, mRNA is extracted, purified, and converted into cDNA. The cDNA is then fragmented and sequenced using next-generation technologies. The resulting sequenced reads are mapped to a reference genome or transcriptome, and the number of reads mapped to each gene, or other biological units indicates its expression level.
Furthermore, [
55] conducted a study on the allopolyploid
Gossypium hirsutum using single-cell RNA-seq (scRNA-seq) to investigate the mechanism of cotton fiber cell initiation from ovule epidermis. scRNA-seq enables profiling of the entire transcriptome of numerous individual cells isolated using microscopic droplets or wells. Molecular barcodes are attached to sequences before amplification to identify the cell origin of each transcript. The core of scRNA-seq analysis is the expression matrix, detailing the transcript counts for each gene and cell, which undergoes data preprocessing and subsequent downstream analyses. Unlike bulk RNA-seq, scRNA-seq provides more precise insights into the heterogeneity of gene expression across different cell types and the molecular trajectory of cell differentiation during development. Nevertheless, bulk RNA-seq remains the primary method for studying expression in polyploids.
Better identification of homoeologs expression now allows for a deeper understanding of allopolyploidy effects on gene expression and regulation. This includes exploring homoeolog expression bias, dosage compensation, and epigenetic modifications, all of which play a critical role in phenotype variation, environmental responses, and progress in crop improvement. Choosing the best strategy to study the transcriptome of allopolyploids depends on several factors, including the nature of the biological question, the range of bioinformatics tools available, the quantity and divergence of sub-genomes within the allopolyploid, and the available information on the parental species.
Therefore, it is necessary to consider different analytical approaches in order to optimize the analysis.
The following paragraphs aim to review the intricate steps and prerequisites involved in transcriptome analysis of allopolyploids, focusing on RNA-seq. These include sub-genome phasing, homoeology inference, RNA-sequencing, read mapping and normalization in identification of differentially expressed genes.
4. Sub-Genome Phasing
In allopolyploid species multiple parental genomes, called sub-genomes, coexist in a single nucleus. A key step is to assign each homoeolog chromosome and gene copy to the corresponding sub-genome. Notoriously, sub-genome identification is simplified if the genomes of putative progenitors are characterized. Nonetheless, the assignment of homoeologs to the corresponding sub-genomes remains challenging due to divergence between the parental and allopolyploid genomes, the high similarity between sub-genomes, and the reshuffling events that characterize the allopolyploid genome. This reorganization includes phenomena such as the loss of chromosomes, homoeologs, and duplicated genes, known as “fractionation” [
56,
57,
58,
59]. Fractionation, if it affects the entire polyploid genome, contributes to “diploidization.” Homoeolog exchanges represent another aspect of this dynamic process, which involves the mispairing of chromosomes of different sub-genomes [
60,
61,
62,
63]. Because of these exchanges, DNA fragments of varying sizes are swapped between sub-genomes, potentially leading to deletions, duplications, and translocations. Although most recombinations between sub-genomes occur between homoeolog chromosomes, occasionally they also occur between non-homoeolog regions [
62].
The identification of sub-genomes is often necessary as preparatory information for homoeolog annotation and correctly distinguishing them from paralogues and allelic variants. Additionally, this identification enables the analysis of differential expression between sub-genomes, offering insights into their distinct regulatory mechanisms and enhancing our understanding of the biology and evolution of the studied organism.
Sub-genome inference is achieved using supervised and unsupervised bioinformatic approaches. The supervised approach exploits data and sequences from diploid parents or closely related species. When such information is available, sub-genomes can be distinguished by mapping allopolyploid-derived DNA- or RNA-seq data on one or both reference genomes, taking advantage of sequence similarity or homeolog-specific polymorphisms [
64,
65,
66]. To achieve this, specialized mapping and categorization algorithms designed specifically for allopolyploid organisms can be used, including PolyCat [
67], SNiPloid [
64], and HANDS2 [
68], particularly if a single reference genome is accessible. Alternatively, PolyDog [
69] is instead used when both parental genomes are available (more details about mapping and categorization algorithms will be given in the subsequent paragraph). Further evidence to sub-genome assignment can be obtained from multi-locus synthetic phylogenetic trees [
27] and synonymous nucleotide substitution approaches implemented in the WGDI pipeline [
70].
When both the sub-genomes and the parental species genomes have been diverging for a long period of time, the sub-genomes are not easily distinguishable. To overcome this hurdle, the sub-genomes separation can be based on large phylogenetic data collection [
71]. Specifically, by evaluating the phylogenies of all individual genes within the allopolyploid species and their homologs in the parental species, it becomes feasible to rank parental leaves based on distance. This process allows for the assignment of parental origins to each homoeolog gene pairs by selecting the leaf with the shortest phylogenetic distance.
The unsupervised strategy comes into play when diploid ancestors are unknown and the methods rely on differences in sub-genome properties, such as GC content or the presence of sub-genome-specific repetitive sequences (e.g., transposable elements). Various programs, including PolyCRACKER [
72] and SubPhaser [
73], implement this methodology and require chromosome-scale genome assembly. Recently, [
74] provided a valuable benchmark and recommendations for assessing the accuracy of the WGDI and SubPhaser tools. Additionally, a comprehensive review on allopolyploid sub-genome identification was conducted by [
75].
5. Homoeology Inference
The identification of homoeologs in transcriptome analysis is essential for obtaining a detailed and comprehensive view of gene expression. It helps determine whether varying expression levels of homoeologs across different sub-genomes influence the phenotype and reveals whether homoeologs have retained similar functions (sub-functionalization) or acquired new functions (neo-functionalization).
The process of identifying ortholog genes, i.e., genes derived from a single ancestral gene and now present in different species [
76], overlaps in some ways with the identification of homoeologs in allopolyploids [
10]. Indeed, homoeologs within the sub-genomes of an allopolyploid species could be considered as orthologs between distinct species, given their common origin from speciation events. Their identification does not imply that the genes have the same function but rather that they arise from a speciation event [
76]. Nevertheless, in comparative genomics the “ortholog conjecture” states that ortholog genes, and therefore also homeologs, have similar functions [
77]. In comparative genomics, homologs genes resulting from a gene duplication event are called paralogs, and those from recent duplications can maintain similar functions compared to distant paralogs separated by millions of years of evolution [
78]. Paralogs that resulted from a duplication event that occurred before speciation are called outparalogs, while those that occurred from a post-speciation duplication are called inparalogs [
79]. Since inparalogs can be grouped under a single gene and defined as co-orthologs (“orthologs group”), not all orthologs in the same group arise from speciation. This situation could be considered a “terminology muddle,” as defined by [
80], since the term extends the relationship to include inparalogs relative to another gene in a different species [
81]. In the context of a hybridization event, these genes would classify as homoeologs to each other [
82]. According to this classification method, homoeolog relationships could vary from one-to-one, one-to-many, or many-to-many, depending on the number of duplications since the speciation of the progenitors [
10]. It is important to note that in allopolyploids and polyploids in general, the presence of redundant genes as paralogs allows the latter to acquire, relatively quickly, mutations, leading to the emergence of sub- neo- and non-functionalizations [
2]. Outparalogs, having had more time to evolve and potentially change function compared to in-paralogs, are typically not considered co-orthologs [
83]. Similarly, they should not be classified as homoeologs for the same reason.
Programs developed to determine orthologs could be adapted to infer homoeologs in allopolyploid species, given the parallels between the two. This often requires a priori knowledge of sub-genomes in allopolyploids. Bioinformatics inference of homoeologs involves a combination of computational and analytical approaches. The choice of method depends on the specific requirements of the study and available resources. Various programs have been developed for this purpose; these can be mainly divided into graph-based and tree-based approaches. The graph-based method assumes that orthologs or homoeologs share sequence similarities. The graph construction phase identifies orthologs or homoeologs through similarity searches, considering pairs of genomes or sub-genomes at a time. In these graphs, genes (or proteins) are depicted as nodes, and their evolutionary relationships are represented as edges [
84].
The Bidirectional Best Hits (BBH) approach is the simplest and most widely used method based on sequence similarity [
85]. This method assumes that two genes are homoeologs if each gene is the best match of the other when their respective sub-genomes are compared. In simpler terms, for a gene pair to be classified as homoeologs, each gene must show higher similarity (e.g., highest alignment score) to its pair than to all other genes in the other sub-genome [
86]. Similarity is typically assessed using BLAST [
87] or other sequence alignment algorithms. However, the BBH method has a limitation: it can only identify orthology in a 1-to-1 relationship [
88]. To complement this method, synteny (the conservation of gene order between two chromosomal regions [
89]) is often employed as additional evidence for identifying homoeologs [
90]. Even when both methods are used, the ability to detect genes that have duplicated and relocated, both before and after allopolyploidization events, remains a challenge. For instance, in plants where gene duplication rates are high, the BBH method alone may fail to identify up to 60% of orthologous relationships in datasets involving 12 Viridiplantae species [
88]. Even when BBH and synteny methods are combined, they may still fail to detect up to 26% of homoeologous relationships in cotton [
86]. To address this limitation, other algorithms such as OMA [
91] and OrthoFinder [
92] have been developed to incorporate inparalog relationships into the graph construction process. The OMA pipeline has been adapted to infer homoeologs in recent studies [
82]. In these studies, homoeologs were inferred in three polyploid species: upland cotton (
Gossypium hirsutum), rapeseed (
Brassica napus), and bread wheat (
Triticum aestivum). Each sub-genome was treated as a separate genome, and orthologs were inferred using the standard OMA pipeline [
91]. Similarly, [
93] identified homoeologs in a synthetic allotetraploid wheat resulting from a cross between the diploid parent species
T. urartu (AA) and
Ae. tauschii (DD). They inferred orthologs between parental genomes using OrthoFinder.
Tree-based methods rely on the assumption that genes share an evolutionary history and employ a resolution technique known as “tree reconciliation.” In this approach, the structure of a gene tree is compared with that of a species tree, and they are reconciled based on the principle of parsimony, aiming to infer the least number of duplication and loss events during the gene’s evolution [
83]. To begin this method, a multiple alignment of homologous sequences is first generated, which forms the basis for constructing a phylogenetic tree of the gene family. During the reconciliation phase, the nodes of this gene tree are classified as either duplication or speciation events by comparing them with the nodes of the species tree. This method is implemented in various software, including Ensembl Compara [
94]. The Ensembl Compara pipeline utilizes the TreeBeST algorithm (
http://treesoft.sourceforge.net/treebest.shtml), originally developed for TreeFam [
95], to infer reconciled gene trees. This pipeline was utilized by [
96] to infer orthologous genes for studying the evolutionary history of the polyploid family Cyprininae.
The effectiveness of tree-based methods hinges heavily on the accurate construction of multiple alignments and trees, which can be computationally intensive and thus limit their application to large datasets [
78]. A hybrid approach that integrates graph- and tree-based methods has been implemented in HomoloGene. In contrast, other software employs a meta-approach to leverage predictions generated by multiple programs, such as MARIO [
97] and MetaPhOrs 2.0 [
98]. These methods have been extensively reviewed in several studies [
10,
78,
83,
86,
99]. It is important to note that when inferring homoeologs based on RNA-seq-derived sequences, misidentification may occur if one or both homoeolog genes are not expressed.
6. RNA-Sequencing
In recent years, expression profiling in allopolyploid species has commonly involved the analysis of RNA-seq data obtained through next-generation sequencing (NGS) technology, with a notable preference for short-read sequencing. Despite the availability of several NGS technology providers, these short reads are often generated using Illumina instruments due to lower cost, high throughput and availability for a wide range of tools and pipelines also suitable for polyploid species [
100,
101]. For example, through the illumina HiSeq 2000 platform, [
37] studied the expression patterns of homolog genes between the allotetraploid cotton
Gossypium hirsutum and its diploid progenitors
Gossypium arboreum and
Gossypium raimondii at the early fiber development stage. While, through Illumina NextSeq 500, [
102] performed a de novo transcriptome assembly of the tetraploid species
Tripsacum dactyloides, a sister genus of
Zea mais, in order to demonstrate that similar genes may be decaying into pseudogenes in both genera after a shared ancient polyploidy event. Considering that the cost of sequencing a genome is affected by its size, the cost for polyploid organisms could be double that for their diploid counterparts. Recently, [
18] explored the expression of homoeologs in allohexaploid wheat and demonstrated cost-effective sequencing using 3′ UTR RNA-seq with the Lasy-Seq protocol [
103]. However, using short reads poses challenges in transcriptome assembly and isoform identification, especially in allopolyploids where homoeolog genes often produce highly similar isoforms [
104]. These challenges are typically addressed by leveraging RNA-seq long reads. Long-read sequencing technologies such as PacBio Iso-Seq [
105] and Nanopore-Based Direct RNA sequencing [
106] offer improvements in
de novo assembly, mapping accuracy, identification of transcript isoforms, and detection of structural variants. Moreover, direct RNA sequencing from native molecules eliminates amplification biases while preserving base modifications [
107]. [
104] used long-read sequencing to identify approximately 175,000 isoforms in nearly 45,000 genes in allotetraploid cotton, revealing that approximately 50% of homoeolog genes produce divergent isoforms in each sub-genome. Consistent with these findings, [
108] subsequently identified numerous novel transcripts in response to salt stress. Additionally, [
109] utilizing PacBio Iso-Seq, performed a comprehensive transcriptome analysis of the allopolyploid
Brassica napus, leading to the construction of an isoform sequencing database.
7. Mapping Reads
Transcriptome analysis in allopolyploids can produce two types of valuable data. First, uniquely mapped reads can reveal the expression levels of homoeologs, provided there are sufficient polymorphisms between homoeolog sequences. Second, ambiguously mapped reads can indicate the absolute expression abundance for each pair of homoeologs. The mapping process can use the parental genomes as references, treating homoeologs as orthologs [
12], or it can be based on the assembled genome of the allopolyploid itself.
Assembling an allopolyploid genome presents significant challenges, exacerbating the difficulties already encountered in diploid genome assembly. One major issue is the incorrect assembly of fragments from one sub-genome into another, leading to chimeric and fragmented contigs. This challenge is primarily due to repetitive sequences, such as satellite DNA, DNA transposons, retrotransposons and variation in gene copy-number. These repetitive elements are prevalent in allopolyploid genomes, and their accurate detection is essential for correct genome annotation [
110,
111,
112].
It is important to note that using reference sequences discordantly annotated can distort the assessment of differential expression of homoeologs. This issue is particularly significant because annotations, often based on RNA-seq data, may incompletely or incorrectly annotate homoeolog exons, especially for genes expressed at low levels. Consequently, depending on the used tool, the mapping stage may be restricted to accurately annotated loci in both sub-genomes. In the absence of a reference genome,
de novo assembly of the transcriptome involves reconstructing the transcript set from multiple samples, potentially under varying conditions, to achieve a transcriptome as comprehensive as possible. However, assembling the transcriptome of an allopolyploid presents nearly two additional challenges compared to its diploid parental species. The first challenge, known as “homoeologs collapse,” occurs when two homoeologs are erroneously assembled into a single hybrid transcript. The second challenge, referred to as “SNP shuffling” arises when a homoeolog is assembled incorporating polymorphisms from the other homoeolog, complicating accurate sequence reconstruction [
113]. In cases where parental transcriptomes or genomes need to be assembled, accurately identifying the parents is crucial but not always straightforward. This complexity underscores the meticulousness required in transcriptome assembly of allopolyploids to avoid these pitfalls and ensure robust data interpretation.
Once reference sequences are identified, two distinct mapping approaches can be used. The first approach utilizes traditional algorithms originally developed for diploid organisms. Well-known tools such as STAR [
114] and HISAT [
115], are used to align RNA-seq reads directly to the reference genome. Additionally, pseudoalignment programs like Kallisto [
116] provide alternative methods for efficient sequence alignment based on transcript abundance estimation. The second approach involves specialized pipelines specifically designed for polyploid organisms. These methods typically fall into two categories:
Similarity-based mapping: These pipelines compare allopolyploid reads to loci in the genomes of the parental species to identify homoeolog sequences.
Genotype-aware mapping: These pipelines aim to detect genotypic differences and classify allopolyploids reads to the corresponding sub-genomes.
Both approaches play crucial roles in overcoming the challenges unique to allopolyploid transcriptome analysis, ensuring accurate quantification of gene expression and facilitating comprehensive understanding of gene regulation in allopolyploid organisms. As a result, these pipelines generate sub-genome specific BAM files that serve as inputs for downstream bioinformatics analyses. Several tools leverage similarity search for such analyses, including HomeoRoq [
117] and PolyDog. The HomeoRoq pipeline requires genomic sequences from both parental organisms and employs STAR to map RNA-seq reads to each reference sequence independently. Reads are then classified based on mismatch counts to determine their genomic origin. Similarly, PolyDog uses genomic sequences from both parental genomes as references. It employes GSNAP [
118] as its default mapping tool to align RNA-seq reads. The final assignment of reads to specific sub-genomes is determined by considering quality scores, alignment length, and mismatch quantity.
Several tools utilize genotypic differences to classify RNA-seq reads to their respective sub-genomes in allopolyploid organisms. Among these tools are PolyCat and EAGLE-RC [
119]. PolyCat uses a single reference sequence along with GSNAP as its default mapping tool. It utilizes SAMtools [
120] for SNP calling to identify “homoeo-SNPs” between parental diploid reads and the reference genome. PolyCat then generates a SNP index, which is used to categorize allopolyploid reads into their respective sub-genomes based on matches with identified homoeo-SNPs. Similarly, EAGLE-RC utilizes an alignment in BAM format and a set of variants in VCF format, along with a single reference genome. It classifies reads based on genotype differences, EAGLE-RC also offers a mode (--ngi mode) that directly classifies reads using the likelihood derived from alignments different sub-genomes, without explicit genotype information. When aiming to determine homoeologs expression level, benchmark studies have demonstrated that traditional software originally designed for diploid species results in higher error rates, particularly exceeding 10% when using pseudoalignment algorithms. However, using sub-genome classification pipelines can significantly reduce mapping errors. For instance, tools like EAGLE-RC achieve error rates below 1%, while HomeoRoq typically achieves less than 2%, depending on variables such as species, expression level, and read length. It is crucial to note that discrepancies in expression level assessment primarily stem from genes with low expression levels. These genes can cause significant shifts in the calculation of expression ratios between homoeologs [
15].
7. Normalization in Differential Expression Analysis
Transcriptome profiling is commonly applied to identify differentially expressed genes. In differential expression analysis, high-throughput RNA-seq data is used to determine if there are statistically significant differences in read abundance at each locus across different experimental conditions. This analysis can involve comparing homoeolog genes within the allopolyploid genome or assessing the global expression levels of each sub-genome relative to their respective parental genomes. These comparisons are crucial for addressing diverse biological questions, including identifying and understanding the mechanisms that enable polyploids exhibiting superior vegetative growth compared to their progenitor species.
Before identifying differentially expressed genes, it is crucial to address biases by normalizing read counts obtained during the quantification step. Errors in normalization can lead to significant consequences in downstream analysis, such as an increased rate of false positives in differential expression analysis [
121]. The primary goal of normalization is to mitigate biases so that differences in normalized read counts accurately reflect true differences in gene expression between samples. Thus, a gene is deemed differentially expressed across biological conditions if it shows varying mRNA levels per sample under those conditions [
122]. The biases that mainly impact the detection of differential expression include gene length, GC content (the proportion of guanine and cytosine nucleotides), and library size (total number of sequenced reads
per sample, i.e., sequencing depth). Gene length and GC content primarily introduce biases in comparisons of gene counts within a single sample (“within-sample” bias). These factors are often disregarded in differential expression analysis between two or more biological conditions. However, they can affect the accuracy of expression level quantification if not appropriately normalized or accounted for. On the other hand, library size plays a crucial role in comparisons of the same gene’s counts across different samples (“between-sample” bias). Differences in sequencing depth between samples can lead to misleading conclusions about differential expressions if not adjusted during normalization.
In RNA-seq libraries, read counts directly correspond to relative gene abundances, meaning highly expressed genes can dominate, thereby reducing the number of reads available for less expressed genes. Consequently, increasing library size enhances the detection sensitivity for poorly expressed genes. Normalizing samples with different library sizes is crucial because without normalization, a sample sequenced at a higher depth than another would show inflated read counts for all genes. This could potentially lead to erroneous conclusions regarding differential expressions between samples. The simplest method of between-sample normalization involves scaling raw read counts within each sample by a sample-specific factor that reflects its library size [
122]. Several data normalization methods have been developed to mitigate this bias in RNA-seq analysis. FPKM (Fragments Per Kilobase Million) [
123] and TPM (transcript per million) methods [
124] primarily focus on “within-sample” normalization, adjusting for gene length within individual samples [
125]. In contrast, TMM (edgeR’s Trimmed Mean of M values) [
126] and RLE (DESeq’s relative log expression) [
127], are applied for “between-sample” normalization where differences in sequencing depth between samples are addressed. No normalization method is without limitations, and each approach has scenarios where its assumptions may not hold. For instance, methods like TMM and RLE are effective in managing differences in mRNA levels
per cell across conditions when there are only a few differentially expressed genes, particularly in cases with minimal asymmetry among highly expressed genes. However, these methods may struggle to handle situations involving a global shift in gene expression, such as when there is a proportional increase in mRNAs
per cell due to changes in ploidy levels. This scenario can lead to substantial asymmetry in the number of highly expressed genes between conditions [
122]. Studies have highlighted instances where varying ploidy levels can cause significant global shifts in gene expression, thereby challenging the applicability of traditional normalization methods [
128,
129,
130].
Therefore, the choice of a normalization strategy depends on the specific biological question being addressed and the characteristics of the samples under study. For example, when comparing samples from the same cell type in diploid organisms, the number of mRNA molecules
per cell (transcriptome size) typically remains relatively stable. However, this consistency does not hold when comparing samples with different ploidy levels, as observed in [
129,
131,
132]. This is because polyploid organisms often exhibit larger cells, which are positively correlated with increased transcriptome size [
133,
134].
Traditional sample preparation methods that solely standardize by sample concentration, without accounting for biomass or cell count, may result in inaccurate interpretations [
135]. For instance, when assessing whether changes in gene expression correlate with variations in photosynthesis
per unit of surface area, it can be more insightful to compare expression levels among samples with equal biomass quantities. This approach provides a more robust basis for analyzing gene expression patterns than relying solely on sample concentration-standardized preparations [
132]. Therefore, particularly when comparing organisms with different ploidy levels, it is crucial to consider various scale of comparison:
per transcriptome (RNA concentration-based),
per cell, and
per biomass (surface area or weights) [
132,
135]. Using these different approaches, studies have demonstrated that the polyploid plant
Tolmiea menziesii (which has a lower cell density compared to the diploid
T. diplomenziesii) tends to maintain diploid-like transcript abundance
per biomass but increasing gene expression
per cell. Notably, this enhanced expression is enriched for functions related to photosynthesis [
132].
To address normalization issues related to biomass or cell count and mitigate transcription size bias, scientists have proposed the use of multiple different exogenous “RNA spike-in” [
136]. These RNA spike-ins, known for their precise concentrations, are added in proportion to the biomass or number of cells prior to RNA-seq library preparation. Subsequently, these spike-ins serve as scaling factors for between-sample normalization after quantification. Implementing RNA spike-ins requires prior knowledge of the biomass quantity or the number of cells included in each sample for accurate normalization. When direct cell counting is impractical, researchers can estimate cell numbers based on DNA content using techniques such as flow cytometry [
130,
136].
In the context of the scenarios mentioned earlier, analyzing differentially expressed genes in allopolyploids can present more complexity compared to standard RNA-seq analyses. A widely used approach was introduced by [
19], where they performed comprehensive comparisons including relative expression levels between the two parents, the hybrid allopolyploid versus its parents, and the allopolyploid versus the mid-parent value. Recently, [
137] adopted a similar strategy, utilizing RNA-sequencing data to explore the transcriptional responses to saline stress in cotton and heterosis events in root traits in rice. In their studies, they introduced HYBRIDEXPRESS, a tool designed to facilitate comparative analyses between allopolyploid hybrids and parental lines. This tool encompasses transcriptomic data processing, graphic visualization, data normalization, identification of differentially expressed genes, classification of genes based on the stages proposed by [
19], and functional annotation of these genes. Such tools, alongside advancements in sequencing technologies and methodologies for homoeology inference and sub-genome identification, hold significant promise in elucidating parental legacies, homoeolog expression biases in allopolyploids, understanding their evolutionary mechanisms, and providing crucial insights for crop breeding.