2.1. Genomics
The genome sequence of the model forest plant species, black cottonwood (
Populus trichocarpa), was published in 2006 [
27]. It was recently updated to version 4.1 (Phytozome 13, release date: 2022 Oct 5) with significantly improved accuracy and continuity compared to the initial version. The release of the black cottonwood genome spawned research in the functional genomics of forest trees, especially angiosperms. However, conifers, which belong to gymnosperms and dominate many temperate and boreal forests, have unusually large genome sizes ranging from 18 to 35 Gb with significantly different functional genomic characteristics from angiosperms [
28]. In 2013, the first conifer genome belonging to Norway spruce (
Picea abies) was published [
29]. It was 4.3 Gb, which is only one-fifth of the Norway spruce actual genome size (19.6 Gb). However, the advent and development of next-generation sequencing (NGS) and third-generation sequencing (TGS) have greatly advanced the capacity to decode genomes for a wide range of forest plants. The emerging long-read and ultra-long-read sequencing and advanced assembly algorithms have provided useful tools for assembling complex plant genomes with high repeat content or heterozygosity (
Figure 2) [
30,
31]. As a result, the difficulty of assembling a complete or near-complete genome of forest plants has been significantly alleviated in recent years [
32]. To date, several high-quality forest plant genomes have been published, including gymnosperm species with genome sizes larger than 10 Gb, such as
Ginkgo biloba [
33],
Taxus chinensis [
34],
T. wallichiana [
35],
Larix kaempferi [
36], and
Pinus tabuliformis [
37].
A well-annotated reference-level genome provides a clear road map for downstream gene function and diversity studies, allowing an in-depth interpretation of genotype-phenotype relationships and functional DNA elements involved in various biological processes [
38]. For example, the release of the black cottonwood genome has enabled studies on the genome-wide identification of regulatory genes and non-coding RNAs (ncRNAs) involved in several important biological processes, such as wood formation [
39,
40], annual growth cycle [
41,
42], flowering process [
43,
44], and responses to abiotic stresses [
45,
46,
47] in
Populus species. Moreover, a comprehensive phylogenetic analysis of functional elements based on the genomes of several model plant species, such as poplar, rice, and Arabidopsis, has been performed. Comparative genomic analysis among the different model species provides insights into the duplication history, selection pressure, and structural divergence of functional genes from an evolutionary aspect [
48,
49]. Therefore, the whole-genome scale discovery and cross-species comparison of insertion and distribution of transposable elements, especially the long terminal repeat retrotransposons (LTR-RTs), could help further investigate their effects on the adjacent gene expression and plant phenotype [
50,
51]. Overall, the accumulation of high-quality genome sequences will provide a comprehensive understanding of the contribution of functional DNA elements to phenotypic variation among forest plants.
Many forest plant species have a wide geographical distribution, large population size, and high genetic diversity at local and regional scales [
52]. Thus, a single reference genome cannot simply represent the DNA sequence diversity within a species. However, population genomics studies using whole genome resequencing (WGRS) or reduced-representation genome sequencing (RRGS) can provide insights into the genetic diversity of forest plants at the SNPs level. SNPs can be detected by mapping reads to a reference genome and subsequent variant calling based on WGRS at the population level. The detected SNPs allow further genome-wide polymorphism analysis during adaptive population divergence. For example, the WGRS of 427 moso bamboos (
Phyllostachys edulis) from multiple representative geographic regions and subsequent population genomic analysis revealed several candidate genes under balancing selection or related to several agriculturally important traits, such as clear culm height, node number, density, and compressive strength [
53]. In addition, several candidate genes related to light response, growth-promoting cytokinin, and wood development were identified by genome sequencing and WGRS of 80 silver birch (
Betula pendula) with clear evidence of recent natural selection [
54]. However, despite the continuously decreasing sequencing cost, the WGRS of many plant samples/species is still expensive. As an alternative or complementary approach to WGRS, RRGS consisting of reduced-representation libraries and restriction-site-associated DNA sequencing, has been developed by integrating restriction enzymes into high-throughput sequencing to obtain a reduced genome representation [
55]. Compared to WGRS, RRGS has apparent advantages of high efficiency and low cost and does not require a reference genome [
55]. RRGS enables genome-wide SNP discovery for non-model species lacking genome sequence information or species with large and complex genomes [
56,
57]. Furthermore, the genetic linkage map can be constructed using RRGS data, and QTL mapping and genome-wide association analysis (GWAS) can be applied to identify the phenotype-genotype relationships across genomes of forest plants [
58,
59,
60]. However, RRGS can also result in missing information possibly related to population differentiation, limiting its application scope.
WGRS and RRGS easily detect SNPs and short insertions and deletions. However, structural variations (SVs), including the presence/absence variants (PAVs), copy number variants, and chromosomal rearrangements, are rarely detected by short-read sequencing [
61]. SVs genetically control the phenotypic variability within and between plant species [
62,
63]. Recent advances in long-read sequencing have enabled the generation of high-quality assemblies for several individuals per species across many plant species, providing a solid foundation for accurately identifying SVs by pan-genomic analysis at the species level [
61,
64]. Therefore, a pan-genome represents a more comprehensive DNA sequence diversity of a plant species or taxonomic group. As a result, pan-genomic studies are carried out to comprehensively understand the genetic diversity of several model plant species or economically important crops, including
Arabidopsis, barley, rice, tomato, and soybean [
65]. However, pan-genome research has been conducted on very few forest plant species, including poplar [
66] and pecan [
67]. Given the universality of sequencing technologies, assembly algorithms, and pan-genomic analysis pipelines, pan-genome will soon become a routine analysis tool for mining genetic variation and functional DNA elements of forest plant species.
In addition to genome sequences, population genomics, and pan-genomics, the three-dimensional genome structure, chloroplast genome, and mitochondrial genome are also useful for gene discovery and genetic engineering of plants [
68,
69,
70]. With the increased different genomic resources, integrating the existing genomic data will facilitate an efficient and accurate functional element identification of forest plants, beneficial for further breeding processes. Genomic data from different sources have been submitted to publicly available databases, such as the National Center of Biotechnology Information, Ensembl, CoGe, GigaDB, and BIG Data Center. The development of a web-based, comprehensive genomic database, similar to BRAD [
71] and Gramene [
72], could also largely accelerate the genomic data integration for the genetic breeding of forest plant species.
2.2. Transcriptomics
Transcriptomics is one of the most commonly used omics approaches in plant biology research. It involves studying the transcriptome, the complete set of transcripts generated by a cell or tissue [
73]. Understanding the transcriptome is crucial in elucidating the structural and functional organization of the genome [
74]. Several hybridization- and sequence-based approaches [
74], such as microarray, expressed sequence tag (EST) sequencing, serial analysis of gene expression, and RNA sequencing (RNA-seq), have been developed for transcriptome profiling. Among these approaches, RNA-seq, which captures all transcripts by high-throughput sequencing, is a revolutionary tool for accurate high-resolution transcriptome analysis [
75]. For example, NGS-based RNA-seq generates millions of short reads, 25 to 300 bp in length.
Many computational tools have been developed to interpret the RNA-seq short reads data. For example, a transcriptome assembly is reconstructed by aligning RNA-seq reads to a known genome assembly using assemblers such as Cufflinks [
76], StringTie [
77], and Scripture [
78]. Even without a reference genome, RNA-seq reads are
de novo assembled into transcripts using assemblers such as Trinity [
79], SOAPdenovo-Trans [
80], and Oases-Velvet [
81]. These reference-based or
de novo strategies have been successfully applied in constructing a reference transcriptome with RNA-seq reads for many plant species. However, although NGS short reads have high accuracy, they rarely span multiple exons. Besides, assembling NGS short reads into full-length transcripts is complicated by alternative splicing events frequently occurring in the genome [
82]. Luckily, this challenge is alleviated by TGS-based transcriptome sequencing approaches, such as full-length isoform sequencing (Iso-Seq) and nanopore-based direct RNA sequencing, which allows the direct sequencing of full-length transcripts without assembly [
83,
84]. However, given the relatively high error rate of TGS reads, highly accurate NGS RNA-seq reads are required to improve TGS-based transcriptome assembly accuracy [
85]. At present, the high-quality and full-length transcriptome of forest plant species, including
Larix kaempferi [
86],
Chosenia arbutifolia [
87],
Fritillaria cirrhosa [
88], and
Alsophila spinulosa has been obtained [
89].
The reference-level transcriptome assembly is crucial for the downstream analysis of gene expressions under multiple conditions (
Figure 3). The gene expression levels are usually evaluated based on the RNA-seq reads. By mapping the RNA-seq reads to a reference transcriptome or genome, the number of reads matching each gene, and gene expression levels are quantified by normalizing the read counts using algorithms, such as fragments per kilobase of mapped reads, transcripts per million and counts per million [
90]. Subsequently, the differential expression (DE) analysis is performed using either non-parametric or parametric tools, such as DESeq2 [
91], edgeR [
92], and SAMseq [
93]. DE analysis is widely used to analyze the forest plant responses to biotic and abiotic stresses, including drought, heat, salinity, flooding, cold, ultraviolet radiation, diseases, and insects [
94,
95]. For instance, the gene expression analysis in poplar root under polyethylene glycol-induced drought stress by TGS and NGS transcriptomic sequencing revealed several differentially expressed genes related to plant responses to drought in the biosynthesis and metabolism pathways [
96]. In addition, the dual RNA-seq analysis of the interactions between Norway spruce and
Heterobasidion revealed that several genes involved in the abscisic acid signaling were differentially expressed in Norway spruce, which might contribute to the Norway spruce response to the pathogen attack [
97]. DE analysis also outlines the transcriptome dynamics across different tissues and plant developmental stages [
98]. Notably, DE analysis under multiple conditions may identify too many or too few differentially expressed genes, requiring the analysts to integrate biological data from other sources or change the software parameters.
The gene expression data can also be used to construct the gene co-expression network (GCN), a powerful tool for further elucidation of the gene regulatory relationships and identification of candidate functional genes [
99]. The weighted gene co-expression network analysis (WGCNA) is a widely used pipeline to construct GCN by clustering genes into modules based on their expression patterns and hub genes within each module [
100]. The gene co-expression analysis has been successfully applied in detecting key functional genes regulating various biological processes in forest plant species, such as
Pinus tabuliformis [
101],
Populus trichocarpa [
102],
Hevea brasiliensis [
103], and
Zanthoxylum armatum [
104]. As a result, the cross-species GCN comparison effectively deduces the origin of new phenotypes and conserved gene functions at the species level [
105]. Although this approach is rarely applied in forest plants, it has great potential in mining hub genes encoding important traits of forest plants. Moreover, transcriptome sequence and RNA-seq data can also be applied in the genome-wide detection of SNP and simple sequence repeat markers (
Figure 3), which are valuable molecular tools in forest plant breeding [
106].
With the advances in RNA-seq and tissue processing approaches, single-cell RNA-seq (scRNA-seq) has become a revolutionary tool for studying plant functional genomics at the cellular level [
107]. scRNA-seq can analyze any tissue in any plant. This emerging approach obtains transcripts of thousands of cells per sample, providing new insights into gene expression heterogeneity across cells. Using the scRNA-seq data, cells can be clustered into different categories using dimensionality reduction and clustering, which allows for the reconstruction of the cell differentiation trajectories [
107]. Distinct expression patterns of different cell clusters provide further insights into the gene regulatory networks and candidate genes related to plant organ development. For instance, the scRNA-seq of 6,796 poplar stem cells predicted the cell differentiation trajectories involved in phloem and xylem development and candidate genes related to vascular development in poplars [
108]. Another newly developed technique, spatial transcriptomics, quantify and localize gene expression within the tissue by combing histological imaging and RNA sequencing [
109]. Its derivative, spatial single-cell transcriptomics, also recently developed, integrates the spatial information from spatial transcriptomics and cellular gene expression based on cRNA-seq, clearly elucidating the complex spatial gene regulatory networks related to plant organ development [
110]. Despite its high costs, spatial single-cell transcriptomics has great potential in precisely breeding forest plants.
2.3. Epigenomics
Epigenomics studies the epigenome, which consists of the biochemical modifications in the nuclear DNA, histone proteins, and ncRNAs [
111]. Although these epigenomic modifications do not alter the nucleotide sequences, they are inherited across generations through mitosis or imprinting. Epigenetic changes, such as DNA methylation, histone modifications and variants, and ncRNA regulation, are frequently induced by environmental stresses or endogenous signals during plant development, which alters the chromatin structure and gene expression [
112]. As a result, epigenetics is a powerful driving force in the environmental adaptation of plants by altering their phenotypic plasticity [
111]. Therefore, epigenomic studies can provide important insights into the epigenetic basis underlying complex phenotypes and local adaptation of forest plants, which cannot be deduced using DNA sequence variants.
The availability of high-quality reference genomes facilitates the genome-wide detection of epigenetic variants at the single-nucleotide level in forest plant species [
113]. DNA methylation has been extensively studied in plants by detecting base modifications using bisulfite or long-read sequencing [
114]. In plants, DNA methylation occurs as CG, CHG, and CHH in gene bodies and transposable elements. More importantly, the methylation levels substantially vary across plant species, tissues, and cells [
115]. In addition, DNA methylation is a highly dynamic process during plant growth and development, including the establishment, maintenance, and active removal of methylation sites [
115]. The dynamics of DNA methylation play important roles in the epigenetic regulation of plant growth, development, and response to environmental stresses [
116]. For instance, several studies have revealed the potential role of DNA methylation in flower development [
117], drought tolerance [
118], wood formation [
119], and immune response to pathogen infection [
120] in
Populus species. Since DNA methylation alters gene expression, transcriptome analysis is usually combined with methylation analysis to investigate the functional relationship between the epigenome and transcriptome [
114].
Moreover, the recently developed single-cell methylation profiling approaches have allowed the tracing of DNA methylation dynamics at the single-cell level [
121]. In addition to the DNA methylation marks, histone marks, including histone modifications and variants, also transcriptionally silence or activate genes [
122,
123]. Histone modifications include histone methylation, acetylation, phosphorylation, ubiquitylation, and sumoylation, reversible amino acid modifications at the N-terminal tail of histone proteins within the nucleosome core [
112]. Among these modifications, histone methylation and acetylation are the most studied modifications regulating plant development and environmental stress response [
124,
125]. Histone variants are sequence variants of core histones H2A, H2B, H3, and H4, which regulate nucleosome structure and function [
126]. Histone marks are detected by DNA/RNA-protein interactions across the genome using chromatin immunoprecipitation sequencing (ChIP-seq) or the transposase-accessible chromatin assay with high-throughput sequencing (ATAC-seq) [
127].
Furthermore, regulatory ncRNAs, including the long non-coding and small RNAs, are important epigenetic marks with diverse functions in response to abiotic stress in forest plants [
128]. For instance, several micro RNAs are differentially expressed during Norway spruce embryo development, potentially contributing to epigenetic memory and climatic adaptation [
129]. Overall, the functional analysis of different epigenetic marks has extended the scope of plant biology.
Understanding the epigenetic mechanisms and variants is beneficial for the epigenetic improvement of forest plants. Naturally or artificially-induced epigenetic variants serve as a novel genetic resource for plant epi-breeding [
112]. Since the naturally occurring epigenetic variations are greatly limited, several laboratory-based approaches, including chemical treatment, biotic and abiotic stress treatment, tissue culture, grafting, RNA interference, and direct epigenome editing by CRISPR/Cas9 have been applied to manually modify the plant epigenome [
130]. These artificial methods induce a wider range of phenotypic variation while increasing the transgenerationally inherited epialleles; hence, they are powerful tools in epi-breeding programs (
Figure 4). Natural and artificially-induced epialleles can be employed as epigenetic markers in quantitative epigenetics based on the epigenetic quantitative trait loci and epigenome-wide association, which are important steps at the early stage of epi-breeding [
131]. Moreover, epigenome editing methods using advanced CRISPR/Cas9 or CRISPR off technologies directly increase the stress resilience of plants through epigenome engineering [
132,
133]. In recent years, quantitative epigenetics and epigenome engineering have been successfully applied in the epigenetic breeding of crop plants, including rice, tomato, potato, and soybean [
134,
135,
136,
137]. Compared to crops, the epigenetic mechanisms and variants in forest plants are relatively understudied. With increased knowledge of the epigenetics of forest plants, epigenetic breeding will play an important role in improving more complex traits, promoting the forest plants adaptation to the changing climate.
2.4. Proteomics
Proteins are large biological molecules, the main undertakers of life activities and important components of plant cells and tissues, which form the physical basis of life. Proteomics studies the proteome, including the composition, localizations, modifications, and interactions of all the proteins expressed in a genome [
138]. The plant proteome significantly varies across the cells and under different developmental and environmental conditions [
139]. Most eukaryotic proteins undergo post-translational modifications, altering protein expressions and functions [
140]. Thus, proteomics is a powerful omics tool for comprehensively understanding the biological processes in the post-genomic era [
141]. Genome sequencing provides DNA sequences of protein-coding genes, laying a solid foundation for proteomics research. The proteome contains much more complex functional gene information than that provided by the genome [
142]. As a result, the accurate identification and quantification of the complete proteome are highly challenging. So far, several sequencing technologies have been developed for proteomics-based analysis (
Figure 5), including protein microarrays, gel-based approaches, quantitative approaches (isobaric tags for absolute and relative quantification (iTRAQ), isotope-coded affinity tag, and stable isotope labeling with amino acids), and high-throughput approaches (mass spectrometry and nuclear magnetic resonance spectroscopy) [
138]. Among these approaches, mass spectrometry with liquid chromatography tandem-mass spectrometry and matrix-assisted laser desorption ionization time-of-flight widely monitors plant proteome dynamics. A comprehensive plant proteome profiling provides valuable insights into the molecular mechanisms underlying plant growth, development, and stress response [
139]. For example, the tandem mass tag-based proteome sequencing of Masson pine (
Pinus massoniana) with different resin yields revealed several differentially expressed proteins related to resinosis [
143]. In addition, the
Picea asperata somatic embryo proteome profiling using iTRAQ and comparative proteomics analysis under partial desiccation treatment provided novel insights into stress-related proteins and metabolic pathways in
P. asperata [
144]. Integrating proteomics, including single-cell and spatial proteomics, with computational approaches and other omics will enhance the proteomics potential in deepening our understanding of the functions and interactions of proteins [
145,
146].
Proteomics also identifies the candidate proteins underlying complex traits by linking protein expression to genetic maps through QTL analysis at the protein level [
147]. The identified proteins serve as powerful biomarkers for the precision breeding of quantitative traits in plants [
148]. In crops, proteomics and associated QTL mapping have successfully identified functional proteins and genes related to production or stress tolerance [
149]. For instance, the large-scale proteome sequencing of 102 barley genotypes revealed drought-sensitive proteins in the different genotypes [
150]. Further genetic linkage analysis of these proteins identified several proteomic QTLs (pQTLs) with potential breeding value for drought-tolerant barley. The label-free proteome sequencing of 148 recombinant inbred lines of pepper (
Capsicum annuum) also revealed several candidate hotspot regions encoding functional proteins related to fruit development by pQTL analysis [
151].
Despite its great potential in plant biology research and genetic breeding, proteomics is limited by several challenges compared to genomic and transcriptomic approaches [
152]. First, the identification and quantification of the whole proteome are still challenging due to the limitations of the different proteome sequencing methods. Second, the precision and reproducibility of proteome sequencing and existing proteomics pipelines are unsatisfactory. Furthermore, deciphering the complex proteomic networks is still challenging since proteomics is far more complex than genomics. To date, proteomics has mainly served as an ancillary strategy for functional studies in the system biology of plants. However, the large-scale applications of plant proteomics are still a long way off [
153].
2.5. Metabolomics
Metabolomics is an emerging post-genomics tool for comprehensive qualitative and quantitative studies of small-molecule metabolites with molar masses below 1000 in the cells or tissues [
154]. Plants produce various metabolites, including primary and secondary metabolites. Primary metabolites are essential for plant growth and development, while secondary metabolites play a major role in the plant responses to environmental factors [
155]. Metabolites are the end products of gene transcription and protein expression within an organism (
Figure 6) and act as links between genotypes and phenotypes [
156]. Among the omics tools, metabolomics has the closest relationship with the phenotype [
157]. Since metabolomics was invented, it has become an increasingly popular system biology tool for deciphering plant science [
158].
The plant metabolome is highly dynamic and complex, with many small-molecule metabolites of diverse structures and content, making accurate metabolome profiling challenging [
159]. To date, a few analytical techniques have been developed for high-throughput quantitative metabolomics, including nuclear magnetic resonance, liquid chromatography-mass spectrometry, capillary electrophoresis-mass spectrometry, and gas chromatography-mass spectrometry [
160]. Metabolomics is classified into targeted and untargeted based on the study subject [
161]. Targeted metabolomics performs target analysis of known metabolites with high sensitivity and accuracy, revealing the fluctuations in specific metabolic pathways. In contrast, untargeted metabolomics performs non-biased detection of all metabolites while identifying the differential metabolites with significant changes for further screening analysis. Like other omics, single-cell sequencing can be integrated into metabolomics to unravel the cellular metabolism dynamics under environmental changes [
162]. Given the metabolic complexity in plants and the limitations in each analytical platform/method, combined approaches are increasingly employed in plant metabolome profiling studies.
The metabolite variation modulates diverse biological processes and may alter plant phenotypes [
163]. As a result, transcriptomics and metabolomics-based correlation analyses have enabled the genome-wide discovery of key genes controlling known and new metabolic pathways [
164]. This strategy has been regularly applied in forest plants, including
Zanthoxylum armatum [
104],
Phyllostachys edulis [
165],
Populus tomentosa [
166], and
Hevea brasiliensis [
167]. Given that most metabolic traits are heritable across generations [
168], combining metabolomics with QTL and GWAS can establish direct links between metabolites and phenotypes based on large-scale population metabolomic and phenotypic data [
158]. To date, most metabolome QTL (mQTL) and metabolome-based GWAS (mGWAS) have been applied in hunting genes related to metabolic traits in crop plants [
158], with a few success cases in poplars and apple trees. For example, mQTL analysis of the untargeted metabolic profiling data and genetic linkage maps revealed mQTL hotspots with many peel- and flesh- related metabolites [
169]. In addition, the mGWAS analysis for flavonoid features in
Popuous tomentosa using targeted metabolomics data revealed more than 1,500 significant associations accounted for phenotypic variation [
170].
Moreover, metabolic markers can be determined using metabolic profiling data under various stress conditions. The selected metabolites can help plant breeders accurately identify stress levels [
158]. Metabolomics has proven to be an effective tool in plant genetic breeding programs. However, a single approach to identifying and quantifying all metabolites within a plant species is still lacking [
160]. Therefore, future advances in metabolomics approaches should discover more interesting gene and metabolic pathways beneficial for further plant breeding.
2.6. Multi-omics integration
Genomics is the most used omics discipline. The whole-genome DNA sequence informs the basic properties of a plant species but cannot solely determine the final phenotype. At the same time, not all DNA sequence variants lead to phenotypic variation. Instead, phenotypic plasticity is shaped by many molecular mechanisms, including epigenetic modification, gene expression and silencing, post-translational protein modification, and metabolite accumulation. Therefore, a single omics cannot sufficiently and comprehensively unravel the complex biological regulatory networks controlling the various phenotypic traits [
171].
The continuous and rapid progress in developing various high-throughput omics technologies has facilitated the integration of different omics data for plant system biology studies [
172]. Transcriptomics, proteomics, and metabolomics are the most frequently used omics technologies in multi-omics integration (MOI) studies of plants, as they are the core of system biology [
172]. MOI studies are accelerated by genomic information provided by well-annotated genomes and associated genomic analysis epigenomics and other omics approaches. However, multiple omics platforms produce much high-throughput data, greatly challenging the subsequent MOI analysis [
173]. The MOI analysis mainly involves the establishment of associations between different omics data sets [
174]. Therefore, analysts require a good understanding of the formats and characteristics of various omics data sources and a good background in software operations, statistical modeling, and data interpretation.
The most simple and intuitive analysis strategy in MOI studies is the correlation analysis of two or more omics data sets using various models, such as Pearson, Spearman, and Kendall rank correlation analyses [
175]. These correlation analyses can be applied to differentially expressed or specific biochemical pathway-related transcripts, proteins, and metabolites. However, various biological factors along with experimental errors, may cause weak correlations between different omics data sets [
172,
176]. For example, transcriptome and proteome sequencing of
Quercus ilex under severe drought conditions recorded a poor correlation (
r = 0.11) between mRNA and protein [
177]. Such inconsistencies are alleviated by further sequencing and analysis. Another strategy for analyzing MOI-related data is clustering analysis based on the similarity of various omics data using hierarchical or partition clustering methods [
178]. Several statistical approaches, including similarity matrices, canonical correlation and co-inertia analysis, and matrix factorization, have been successfully applied in grouping forest plant multi-omics data [
178]. For instance, Pascual et al. (2017) used a
k-means clustering approach to integrate proteomic, metabolomic, and physiological data of
Pinus radiata based on their quantitative trends during different periods of ultra violet treatment, obtaining 30 clusters [
179]. The clustering results can be further correlated with specific scientific questions. Moreover, multivariate-based analysis has enabled the integration of multi-omics data using multi-variant data analysis approaches, such as principal components analysis, partial least squares, and orthogonal projections to latent structures (OPLS) [
180]. For example, the OPLS analysis of the transcriptome, proteome, and metabolome data from transgenic poplars identified several proteins related to wood formation [
181].
A common weakness of correlation, clustering, and multivariate-based analyses is that they are based on statistical methods rather than prior knowledge of molecular mechanisms. To integrate known pathway information into MOI analysis, bioinformatics analysts have developed several pathway-based approaches, such as pathway mapping and co-expression analyses [
172]. Pathway mapping analysis maps various omics data sets against publicly available pathway databases, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and MetaCyc [
182]. For example, López-Hidalgo et al. (2018) reconstructed and visualized 123 of the 127 known KEGG pathways at the transcriptome, proteome, and metabolome level for
Quercus ilex using MapMan software [
183]. At the same time, multiple omics data sets could be employed to annotate certain plant metabolic pathways using known pathway information as the reference. For example, Wang et al. (2021) reconstructed the biosynthesis of two alkaloids, sanshools and wgx-50, using transcriptome and metabolome data [
104].
Alternatively, the co-expression analysis can be integrated into existing pathway databases. WGCNA approach, the most popular gene co-expression analysis tool, can detect regulatory networks for each omics layer, construct a consensus correlation network [
184,
185], and identify hub elements, such as hub genes, proteins, and metabolites. Besides, the multi-omics WGCNA approach can perform efficient gene and module clustering and provide key regulatory network information through MOI.
However, neither statistical-based nor pathway-based MOI methods are independent. Instead, these methods are often combined to answer biological questions precisely. Several complementary MOI approaches exist, such as top-down differential analysis and bottom-up genome-scale modeling [
172]. These MOI approaches have been widely used in the in-depth analysis of complex metabolic pathways and other biological processes in forest plants. However, the heterogeneity in signal-to-noise ratio across multiple omics layers, the adverse effects of missing values, the limitation in the interpretation ability of multi-omics models, and data sharing difficulties, such as metadata annotation, data storage, and computing resources remain universal challenges across all-pervading MOI studies [
186]. In any case, MOI analysis is a powerful tool for genome-wide functional element identification and forest plant breeding. Besides, the development of single-cell sequencing technology provides an exciting opportunity for single-cell MOI analysis.