As we have indicated, crop genetic improvement programs around the world are currently promoting the integration of omics science, bioinformatics, biotechnology, and agronomic disciplines in the development of improved crops. One of the main difficulties in this approach is to relate the omics results obtained with each type of data and to integrate this information. To properly develop this process, several levels of data processing must be performed. From basic quality control of sequencing products to the exploration of metabolic and regulatory systems. In this section, we discuss the many applications of omics databases from Arabidopsis and other model plants for the study of crops or plants of economic importance.
2.1. De Novo Assembly of NGS Products and Other Types of Omics Information
The use of NGS techniques to speed up the breeding process is commonly suggested in current plant breeding programs today. These strategies are finally included in the objectives of plant breeding projects as useful tools for prospecting the goals of these projects [
9]. Most plant species of commercial interest have already had their genomes partially or completely sequenced. In the best situation, it might be possible to have a reference transcriptome, genome, and even metabolome of the plant being studied [
10]. In the worst case, the only available information is individual sequences of the studied species. This will make it more difficult to process the NGS data and, as a result, assemble genomes, transcriptomes, and other omics data correctly. It is possible to assemble the NGS reads using only mathematical parameters with a variety of bioinformatics tools. With its single processor version, ABySS is helpful for assembling helps assemble genomes up to 100 bases in length. furthermore, having a Trans-ABySS version for transcriptome assembly. Other such examples are EDENA, aTRAM, and EULER, which can de novo assembly short NGS reads. Long reads from PacBio or Oxford Nanopore NGS, which can have higher error rates than short reads from Illumina, can be assembled using software like CANU, FALCON, and HGAP, among others.
Although becoming accurate and flexible, mathematical models have limitations that are inherent to biology itself. For example, mathematical models often ignore biological compartments like the nucleus or mitochondria and instead, assume that all components in a reaction are equally accessible [
11]. The set of NGS readings shows an analogous pattern, since the models occasionally underrepresented some repetitive sequences which could be related to crucial regulatory roles. In this case, the assembly is statistically correct from a mathematical perspective but is incomplete and diverged from the expected behavior from a biological perspective [
12]. For this reason, it is suggested that the de novo assembly use the Arabidopsis thaliana genome as a reference (Figure 1). The biological approach provided by the Arabidopsis genome adjusts the mathematical model and brings the results closer to biological reality, even in some cases where there is a significant evolutionary distance between the researched plant and Arabidopsis [
13].
Figure 1.
Recent research has shown that using the Arabidopsis genome as a reference improves the efficacy of de novo assembly compared to experiments that did not use a reference genome [
14]. This approach may be employed with a more focused strategy, such as an evolutionary adaptation, to improve plants with commercial value. Hypothetically, genome sequencing of cacao plants endemic to the Moskitia region (Cocoa Vavilov Center) in Honduras, could allow for the discovery of genes that provide information to improve cocoa crops. Vavilov centers are composed of individuals who have the highest diversity and ancestry within that plant species, resulting in the ideal candidates for discovering ancestral genes for plant improvement. These individuals' genomes can be explored to discover information useful to agronomy, biotechnology, and botany [
15]. Inbreeding domestication techniques and conventional breeding have been observed to lead to the silencing or deletion of various genes that enable plants to resist attack by pests and diseases [
16].
The genomes of plants with higher levels of ancestry carry genes and metabolic pathways that may be related to their evolutionary origins because native plants are more interrelated to the plants that gave rise to commercial crops [
17]. Ironically, while the genomes of some crops are available, curated, and annotated, it would be necessary to use the omics information of domesticated plants as a reference for assembling the genomes of these crops' ancestral plants. The main goal of plant breeding programs soon will be to produce plants that are resilient to a variety of environmental conditions and supply sustenance. The development of novel plant varieties will accelerate if high-quality genetic information from ancestor plants is available, this is because selection and transfer of genes and metabolic pathways will be more efficacious [
18,
19].
Although they are comparable, the main difference between this method and reference genome assembly is the genome used as the reference. The reference genome assembly method makes use of a previously assembled genome as a scaffold [
14]. Ideally, this genome should come from the species under study or from one that is closely related to it. In this way, the model is adjusted using experimental curated data of the species that serves as a reference. In the second method, the reference genome comes from a different or evolutionarily distant species. In contrast to the earlier case, the goal of the Arabidopsis genome is to provide a biological reference framework that aids in adjusting the mathematical model and not an assembly scaffold, thus, the method still is essentially a de novo assembly [
20].
When working with a species of agricultural or forestry interest at the genomic level for the first time, or when exploring plant alternatives with the potential to become commercial species, this method can be applied [
13]. For example, to identify a new tuber species with nutritional potential but low yield. It is possible that the species being researched lacks the species being researched may lack an evolutionary-related organism with a genome that can serve as a reference. A de novo assembly is typically employed in this situation [
21]. This assembly will avoid losing some features of the genomic organization shared by many plant species by using the Arabidopsis genome as a reference, which, if merely a mathematical model were used, would be lost.
The data will be analyzed from a strictly mathematical perspective, ignoring some biological peculiarities, to determine its probability of being aligned and oriented by the theoretical reality. In some instances, this implies that the assembled genome may have underrepresented genes because of the similarity of some conserved regions in distinct genes [
22]. A biological model that includes the possibility of these isoforms or genetic variations is employed to solve the problem. When there is no reference genome of the organism being investigated, using the genome of Arabidopsis may be an option because of the extensive knowledge we have about it and its importance as a model organism.
The information included in transcriptomes and metabolomes shows a similar pattern. Transcriptomes, in this example, show the degree of gene expression over a period and space. To make accurate comparisons, one would need a reference that was generated at a similar and comparable period and location as the transcriptome under study [
23]. Therefore, to determine the amount of Resistance Genes (R-Gene) expression, it is important to evaluate the gene sets of both the host and the pathogen at different time points and in different tissues to predict the gene variations in expression [
24,
25,
26,
27]. From a practical standpoint, it is not possible to have reference transcriptomes for every scenario. Nonetheless, there are alternatives that some alternatives can be used to improve the algorithms that allow us to identify and measure sequences of interest. These options, however, can function more as an adaptation to the mathematical model than as a reference framework. There are already about 20,000 Arabidopsis RNA-seq libraries deposited in open databases (
http://ipf.sustech.edu.cn/pub/athrna/), this data provides an excellent platform for comparison in a variety of situations, including those affecting transcriptional regulation, tissue specificity, stress responses, and dynamics of gene development [
28].
Regardless of the type of information one is working with, supplying a reference is necessary when analyzing omics data. In some cases, such as human exomes, enough information is available to produce precise and biologically coherent assemblies, but in other cases, using a de novo assembly is the only option. In these cases, it will always be preferable to adjust the algorithms and mathematical models using data from a reference organism or an extensively studied as a reference. The results of this de novo assembly approach are more accurate and biologically plausible than using mathematical models alone through the scoring system. Assembling genomic, transcriptomic, metabolomic, etc., data from plants using the available and carefully selected information on Arabidopsis can help to perform this procedure with greater precision and a closer approximation to biological reality. This can lead to a deeper understanding of the organisms and an evaluation of their potential for genetic improvement or crop use.
2.2. Annotations of Crops Omics Information without Reference Genome
Once a genome has been assembled, it is usually essential to interpret the generated sequences. Sequence annotation is a procedure that often involves analyzing sequences throughout many databases in hopes of finding as much information as possible about these sequences [
29]. The selection of the database(s) that will be used for the annotation, as well as the software used to conduct the procedure, are common steps in the annotation process (Figure 2). Using specific databases that reduce the errors that can be produced when comparing a genome against all the possible known sequences is a widely used method to increase the pressure on these annotations [
30]. In some cases, the results of annotation against the universe of sequences' data can lead to comparisons with distant species' sequences, which, regardless of their high statistical quality value, ignore the approach to biological reality yet again. The accuracy of the annotations and, thus, the quality of the process, can be improved by using data from model organisms. Additionally, this method can use more plant databases as a reference for the accuracy of the information it delivers.
Figure 2.
For example, it is typical to search specific molecular patterns conserved in a class of genes when researching for R-Genes in a plant species' genome. Some of these patterns, however, are present in other taxonomic groups and can be related to proteins that, in some cases, have distinct functions [
31]. For instance, NBS-LRR-like proteins have a significant role in immunity in both plants and widely diverse organisms like mammals. Consequently, it might be more useful to annotate the sequences using databases of plant models and reference organisms to identify the R-Genes more precisely in plants' WGS [
32]. By restricting the data for comparison, search models like Hidden Markov Model (HMM) or Artificial Neural Networks (ANN) also adjust to the biological reality of plants.
The
Arabidopsis thaliana genome assembly (TAIR 10.1) is available on the National Center for Biotechnology Information (NCBI) (
https://www.ncbi.nlm.nih.gov/genome/4), Ensemble Plant (
http://plants.ensembl.org/Arabidopsis_thaliana/Info/Index), and TAIR (
https://www.arabidopsis.org/index.jsp) website where tools for bioinformatics analysis can also be found, among other places. In addition to these databases, several commercially important plants with accessible reference genomes include rice (
http://rice.uga.edu/), wheat (
https://www.wheatgenome.org/), maize (
https://www.maizegdb.org/), potato and tomato (
https://solgenomics.net), arabica coffee (
https://coffeegenome.ucdavis.edu/), sugar cane (
https://sugarcane-genome.cirad.fr/), banana (
https://banana-genome-hub.southgreen.fr/), and citrus (
https://www.citrusgenomedb.org/), among others. On the other hand, we may require more than one type of software, depending on the type of annotation we want.
Another example is the study of metabolic pathways that help break the dormancy of some seeds of economically useful plant species, such as Coyol (
Acrocomia aculeata). In this example, the main goal is to determine which metabolites in Coyol seeds help to break the dormancy. For this purpose, transcriptomic and metabolomic data and their integration can be used, and which genes are expressed at the time of latency break can be determined by transcription [
33]. This would allow researchers to find the proteins and enzymes needed to produce these metabolites. Also, it will be possible to find the concentration of these metabolites and their effects at distinct stages of seed germination by metabolomics [
34]. Like other palms, Coyol has a very low germination rate compared to the amount of fruit it produces. Nevertheless, they should be considered important plants because of their cultural use and potential for biofuel production [
35]. Identification of metabolites that break seed dormancy could help the breeding of endangered species and domestication of wild species for cultural purposes, with increased long-term survival.
Applications in structural and functional biology can help both from annotations of genes and proteins. It can help explain proteins and genes and their interaction and control by defining the molecular function, cellular location, or biological process in which they are involved. They even allow the calculation of evolutionary distances between proteins and genes in comparison to other members of the same plant species or in relation to a specific protein family. In general, annotations depend on the goal of the project and the type of data that can be accessed. Predictions based on information now known about a particular protein type, gene, domain, etc., will occasionally be the results, not just annotations. The system must perform as predicted because there is no in vivo evidence to contradict what the simulations show. The predictions can be considered curated annotations after this information has been verified by experiments. And yet, the simulations contain a significant statistical component that underpins the results.
2.3. Using Data from Arabidopsis to Mapping Metabolic Pathways in Plants
A considerable amount of knowledge about metabolic pathway genes has been accumulated through biochemical and genetic approaches [
36]. This increasing information of biological data facilitates the discovery of new metabolic pathways by using mathematical modeling approaches to select candidate genes involved in general and specific functions [
37]. Numerous biological metabolic networks in various organisms can be built and analyzed thanks to the development of bioinformatics tools and the accessibility of relevant information in databases [
38] (Figure 3).
The databases contain a wide range of species that can be investigated for data, including information at the exon, transcriptional, and gene levels. This data is input into further investigations, enabling the use of different bioinformatics tools for functional analysis such as modeling of signaling pathways [
39]. Tools for visualizing and analyzing metabolic pathways include databases, software, and software- packages. These instruments can be used to determine the enzymes and metabolites engaged in a certain pathway, to forecast the impacts of genetic or environmental changes on pathway activity, and to produce hypotheses regarding the roles of as-yet-uncharacterized enzymes or metabolites [
40].
While functional analysis tools use a wide range of methodologies, they can be categorized into three main groups: over-representation analysis, functional class scoring, and pathway topology [
39]. The R package DOSE (
https://bioconductor.org/packages/release/bioc/html/DOSE.html), which is designed for DO-based semantic similarity measurement and enrichment analysis, is one example of a tool that fits within these categories. Pathview is a set of tools for pathway-based data integration and visualization (
https://bioconductor.org/packages/release/bioc/html/pathview.html). Likewise, the cluster Profiler package (
https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) provides methods for analyzing and displaying the functional profiles of genes and gene clusters. Some examples of web-based pathway tools include KEGG (
https://www.genome.jp/kegg/), MetaCyc (
https://metacyc.org/), and Reactome (
https://reactome.org/). These tools are widely used in the field of systems biology to study the complex interactions between genes, proteins, and metabolites that underlie cellular metabolism [
40]. A pipeline for investigating metabolic pathways can be built using these strategic techniques in conjunction, and it might include:
Collect information on relevant metabolites, enzymes, and pathways from a variety of sources, including literature, experimental data, and pathway databases [
38].
Using metabolic mapping tools for building a metabolic pathway map that includes all the metabolites and enzymes involved in the pathway [
38]. It involves obtaining and compiling data on biochemical reactions from current sources, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG), to discover the functional annotation of genes [
38,
41].
Making predictions regarding the roles of uncharacterized enzymes or metabolites while using pathway tools to examine the pathway map, identify important enzymes and metabolites, and predict the effects of genetic or environmental changes on pathway activity [
41].
By using functional targeted and untargeted metabolomics, it is possible to understand how enzymes and pathways work as well as find out which metabolites change in response to perturbations [
40,
41].
Using the pathway information to develop new strategies against affections that are associated with dysregulated metabolic pathways [
42].
The advancement of computational methods and the availability of multi-omics data have made it possible to predict the metabolic pathways of important plant chemicals. To better understand the genes involved in the generation and modification of plant metabolites, which is important in increasing plant productivity and quality, it complements conventional genetic and/or biochemical approaches.
Figure 3.