3.1. Data types, Repositories, and Knowledge Bases
Plant phenotyping is the key for plant breeding, characterization of biodiversity, and genetic and genomic-based approaches for translational research (55). The classical genetic and functional genomics studies in model and crop plants have identified numerous mutants that show distinct morphological and anatomical mutants and associated the individual mutant phenotypes with one or more genes, pathways and molecular processes.
Table 3 lists databases that host the mutant collections and description of phenotype of individual mutants and associated genes, including MaizeDIG (42), RIKEN Arabidopsis Genome Encyclopedia (56), Mutant Variety Database (57), Plant Genome Editing Database (58)
, Tomato mutant Archive TOMATOMA (59), and Plant Editosome Database (60).
In addition, complex phenotypic traits (i.e., morphological and physiological) related to the fitness and performance of an organism are often quantitative in nature and have multiple genetic determinants (61,62). Examples of traits that are determined by multiple genes (known as Quantitative Trait Loci, QTLs) are crop yield, biomass, resistance to pests and pathogens, abiotic stress tolerance, nutritional value, and ease of harvest. In addition to crop breeding, trait-based approaches are widespread in ecological research (63), as they provide a general understanding of a wide range of ecological and evolutionary phenomena such as impact of climate change, and anthropogenic land use on biodiversity (64-66). In
Table 3, we provide a list of a key databases (or portal of bigger databases) that host information related to traits, QTLs, and associated data including the Gramene QTL database (67), QTL database for wheat (68), GLOPNET (69), TRY (70), a database of Ecological Flora of the Britain and Ireland (71), BIOPOP (72), GRIN (73), the USDA PLANTS Database, BiolFlor (74), LEDA Traitbase (75), BROT database of plant traits for Mediterranean basin species (76), and AusTraits (77). Trait and QTL data are also integrated with other types of data in various crop community databases listed in
Table 2.
Phenomics is the systematic analysis for the refinement and characterization of phenotypes on a genome-wide scale. With the advent of high-throughput platforms, it became possible to collect phenomics data at a single cell, organismal and/or population-wide scale (78). Phenomics can be used for species recognition and biodiversity characterization (79), for stress quantification (79-81), and for crop yield prediction (82,83). Thus, phenomics data sets are very large and have different formats (e.g., JSON file). Some of the databases that host phenomics data include GnpIS (84,85), PGP (86), Cartograplant (87), AgData commons (
https://data.nal.usda.gov/; (88), PathoPlant (89,90), PncStress (91), OSRGD (92).
Despite its analogy to genomes, it is not possible to fully characterize phenomes due to heterogeneity and multifaceted nature of phenotypic data with added layers reflecting complexities at the cell, tissue, and whole plant level that have further variations according to development stages, and growth environment (78,93). Thus, phenomics approaches may focus on specific factors of phenotypic data. For example, an intensive phenomics study may focus on high-throughput digital imaging across different stages and tissues of an organism under different growth stages or growth environments and may include quantitative data about plant height, biomass, flowering time, yield, and photosynthesis efficiency. Another study may employ orthomosaic images or time-series RGB images and remote sensing to monitor the algal blooms in the ocean (94). As phenomics data can be extremely variable in nature, necessary metadata includes information about plant species, tissue, developmental stage, environmental conditions, experimental design, data collection, processing, and analysis.
In addition to traditional phenotypes, molecular phenotypes include changes in the chromatin organization, transcripts, proteins, metabolites and ions (95-97). The quantitative changes in the gene expression, proteins and metabolite profiles in plants have far-reaching consequences for (i) the nutritional values of cereals, legumes, fruits, vegetables; (ii) the quality of bio products such as wine, beverages, vinegar, oil, and fuel; (iii) the ability of plants to adapt in response to various abiotic stress conditions; and (iv) the innate ability to defend against pests, pathogens, and herbivores (98-102).
Proteome and metabolome datasets allow the deeper understanding of an organism’s metabolic processes at the level of organ, tissue, and cell, as well as how these processes change in response to intrinsic developmental programs and environmental factors. Proteome datasets further confirm the subcellular localization, their comparative abundance between different tissues and cells, protein–protein interactions, and post translational modifications (103). Once the original proteomic datasets and associated metadata/manuscript have been submitted to public data repositories such as PRIDE (103-105), MassIVE (
https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp), JPOST (106,107), IProX (108,109), Panorama (110), and Peptide Atlas (111,112), they are made available for re-analysis and further exploration by other researchers. Metabolomics provides a comprehensive overview of the metabolite profile of an organism, tissues, cells, or subcellular component at a specific time point and is used to identify nutritional, medicinal, flavor, and disease resistance compounds as well as chemical interactions between plants and other biological systems. A recent comprehensive review of the methodologies to explore the highly complex and diverse metabolites of plants and associated methodologies can be found in Tsugawa et al., 2021 (113). The types of data collected for metabolomics depends on the method of chemical fingerprinting. As an example, in mass spectrometry (MS), a typical dataset would consist of a matrix containing information on the retention time and index (RT), mass-to-charge ratio (m/z), and peak characteristics such as the number and width. These data go through pre-processing which converts raw instrument data into organized formats using background subtraction, noise reduction, curve resolution, peak picking, peak thresholding, and spectral deconvolution. There are various software tools for analyzing metabolite data, each of which may be specific to a particular method of detection or instrument used in the analysis. The most popular software are MZmine, XCMS, MSdial, metaMS, Progenesis QI and MetAlign. For annotation for unknown metabolites, popular software tools include MS-FINDER, MetDNA, MetFamily, and GNPS among others. Raw file formats generated by the machines include d, raw, idb, cdf, wiff, scan, dat, cmp, cdf.cmp, lcd, abf, jpf, xps, mgf. Derived file formats are mzml, nmrml, mzxml, xml, mzdata, cef, cnx, peakml, xy, smp, scan. Due to the complexity of metabolomic data, several initiatives were undertaken. The Chemical Analyses Working group started the
Metabolomics Standard Initiative (MSI) to develop metabolomic standards (114,115) with revisions suggested by (116). Community driven Metabolomics Society has a Data Standards Task Group focusing on metabolomics data standardization and sharing. This was followed by the
Coordination of Standards in Metabolomics' (COSMOS) (117), and MetaboLights (118), for developing tools to ease submission of metabolomic data (119). ProteomeCentral and Omics DI serve as central repositories for these datasets, which are then re-used in protein knowledge bases (Uniprot and NeXtProt), genome browsers (Ensembl and UCSC), proteomics resources and other bioinformatics resources (ex. OpenProt and LNCipedia). The ProteomeXchange (PX) datasets are re-analyzed by different proteomics resources of the PX consortium, making data more reliable. The Paired Omics Data Platform (PoDP) (120) links the metabolomics data submitted to MassIVE or MetaboLights to genomes stored in NCBI or JGI. In
Table 3 we list the two major repositories available for submission of raw and processed metabolome data, the NIH Common Fund's National Metabolomics Data Repository (NMDR) portal and the Metabolomics Workbench, and MetaboLights.
Some gene expression and metabolic phenotype often culminate in visible phenotypes, which can be described using the Plant Ontology terms (121-123). More recently, Plant Ontology terms have been extended to large scale phenomics data from a single species (124) to support the comparative phenomics in plants (125) and describe trait phenotypes expressed under specific developmental stage or specific environment and stress (126). For covering the genotype-phenotype gap, we need integration of multiple types of data including genotypic, large-scale phenome, gene expression, proteome and metabolome data, described using defined and standardized ontologies.
After collecting and generating phenotypic and phenomics data, it is recommended that they are formatted using community guidelines and submitted to primary data repositories, along with well-described metadata. The primary repositories serve as a source of primary or raw data (with base annotations) to the secondary databases for their visualization on genome browser (127) or for synthesizing new information by integrating them to other data types like plant metabolic networks (128,129), system-level plant pathways (130-132), expression Atlas, metabolic models, etc. These secondary knowledge bases are of primary importance to the plant researchers for formulating data-driven hypothesis for experimental and translational research and for analyzing the high-throughput omics data in the overall context of a species genome, systems-level pathway networks (133), and for gaining evolutionary insights by conducting intraspecies and interspecies comparisons. The implementation of standards and the development of infrastructure of public repositories are crucial for FAIR phenotypic data, even if many public repositories are currently not supporting the submission of the phenotype data (see
Table 3).
3.2. Phenotype data formats, standards and metadata
The structure and characteristics of data types, along with any additional metadata, is crucial for enabling future data re-use and re-analysis by other researchers. The most relevant metadata shared across the various data types (generated by a diverse set of methods and platform) include taxonomic identification of the plant, the individual or cultivar name or accession ID, geo-references or growth conditions, field sampling or experimental design, cell, tissue, organ information (e.g., whole plant, leaf, root, flower, shoot, single cell, etc.), plant maturity and health status, measurement date (season, time of the day), and the type of phenotype measured (quantitative or qualitative) (70,134). These metadata can be entered as simple text format during the submission of the raw data to any primary repository and are easily exported from one database to another as TXT files.
Furthermore, plant phenotype/traits can be classified as categorical (qualitative and ordinal) or quantitative (continuous) traits (135). Some phenotypes are rather stable within species (mostly categorical traits), and some of these can be systematically compiled from species checklists and floras (e.g., (136). Thus, not all phenotypes can be mapped from one species to another. It is also important to note here that often, a phenotype is a cumulative outcome of the genotype, the environment and their interaction. Many important agronomic traits, such as seed or fruit quality, yield, abiotic stress tolerance, and pathogen resistance have a quantitative genetic architecture, involving minor and major genes or QTLs. Thus, the research question and the method become important to set the scope and goals of the study and require specific metadata and standards. For instance, most traits relevant to ecology and earth system sciences are characterized by intraspecific variability and trait–environment relationships (mostly quantitative traits). These traits have to be measured on individual plants in their particular environmental context. Each such trait measurement has high information content as it captures the specific response of a given genome to the prevailing environmental conditions (70). Thus, the collection of these quantitative traits and their essential environmental covariates is of vital importance. While trait measurements themselves may be relatively simple, the selection of the adequate entity (e.g., a representative plant in a community, or a representative leaf on a tree) and obtaining the relevant ancillary data (taxonomic identification, soil and climate properties, disturbance history, etc.) may require sophisticated instruments and a high degree of expertise and experience. Besides, these data are most often individual measurements with a low degree of automation. This not only limits the number of measurements but also causes a high risk of errors, which need to be corrected a posteriori, requiring substantial human work. Hence, the integration of these data from different sources into a consistent data set requires a carefully designed workflow with sufficient data quality assurance. These measurements of quantitative traits are single sampling events for particular individuals at certain locations and times, which preserve relevant information on intraspecific variation and provide the necessary detail to address questions at the level of populations or communities (134). Hence, an accurate and careful collection of data, their associated meta-data and ancillary data, is key to correctly preserve this valuable information, as well as to perform a suitable data integration across studies, species and data types.