3.1. Morphology
Morphological data can often be quite difficult to score objectively for use in shallow phylogenetic studies, given that differences between very closely related taxa can be hard to quantify. Thus, they have tended to fall out of favor in such studies as compared to DNA sequences, and at most are mapped onto the tree built from molecular data. However, in deep phylogenetic studies some morphological characters are easy to score objectively, if they are highly conserved. For example stomates are quite easy to compare among the major groups of land plants [
56]. While we do not have good evolutionary models for such complex characters, and they are relatively few in number as compared to nucleotide characters, which could both be considered drawbacks, morphological characters have strengths that more than compensate.
Morphological characters have more states (e.g., there are many more ways to modify stomates than there are possible point mutations at a given site) -- that alone reduces the possibility of LBA since the more possible states, the less likely for a homoplastic match to occur at random [
57]. Homology can be easier to hypothesize in morphology, given its 3-dimensional positional information plus ontogeny, as compared to the one-dimensional positional information present for use in DNA sequence alignments. Furthermore, morphological characters tend to show episodic evolution, i.e., major changes followed by periods of stasis. A clock-like marker is ideal for a shallow phylogenetic study, but the ideal character for a deep phylogenetic study is a broken clock, which quit ticking at some point in the past and retains its state until the present [
3]. Finally, having morphological characters in the matrix can allow the addition of fossil taxa to the matrix, if those characters are observable in them, which is a potential way to break up long branches and reduce LBA.
Some have objected to morphological characters because they are more likely to be subject to selection than DNA sequence data, but this is an unfair and illogical comparison -- certainly genes that are conserved enough to be useful in deep phylogenetic studies must be under tremendous selection pressure as well. Having functional ribosomes and a working RuBisCO enzyme is certainly more important to fitness of a plant than a certain shape of stomata. Convergence at the molecular level just needs to be considered as seriously as with morphology. Conflicting topologies among different datasets is not only due to LBA, lineage sorting, or horizontal transfer (the three processes most often invoked), it can just as well be due to selectively-driven convergence (a process that is completely ignored by current analytical methods). Our ability to find the true history in the face of all these confounding processes depends on the inclusion of as many distinctly different data types as possible. Each type may have its issues, but at least if the issues with one data type are independent of the issues of other data types, there is hope of finding a common historical pattern shared by all. When morphology is left out, an important category of distinct data is left out.
3.2. Nucleotide Sequences
Nucleotide sequences represent the most abundant characters for phylogenetic reconstruction. They include coding and non-coding regions that make up the genome and are directly inherited (unlike phenotypic features). For protein coding genes, their amino acid sequences are often used in analysis to minimize the effect of long branch attraction caused by the limited number of character states in nucleotide sequences. Both DNA and protein sequences have the advantage of clearly defined states. Well understood chemistry of the four nucleotides and 20 amino acids also allows development of evolutionary models relatively easily. In recent years, automated sequencing has made acquisition of this colossal body of potential phylogenetic information within the reach of systematists.
The massive number of sequence characters, however, poses difficulties for sorting out phylogenetic signals from noise generated by LBA, convergence, lineage sorting, and reticulate evolution, as a genome has been shaped by many forces over evolutionary time. One manifestation of this complexity is evolutionary rate heterogeneity in different genes and at different positions in DNA and protein sequences [
58,
59] as well as across lineages of organisms [
60]. Thus, although evolution at DNA and protein levels may seem to be easier to model than at other levels, it is no simple task to develop models to fit real data. Specifically, several factors may contribute to the data-model fit challenge. First, a vast majority of nuclear genes evolve much faster than chloroplast and mitochondrial genes used in earlier years of molecular systematics. This rate increase complicates an already difficult rate heterogeneity problem. Second, most phylogenomic data sets are much larger than multigene data sets in character number but not in taxon number, causing under-sampling of intermediate states for faster evolving characters. Third, because most sequenced genomes are from model organisms or economically important species, taxon sampling in phylogenomic studies tends to skew densely toward certain groups yet sparsely in phylogenetically critically positioned taxa, resulting in highly uneven representation of diversity in the study group. As a result, many minor insignificant but system-wide factors that generate only negligible random errors in single to multigene matrices can exert their cumulative influence to produce systematic errors leading to statistically well supported but incorrect results in phylogenomic studies [
61].
Recent studies have shown that algorithms employing models that incorporate higher- level genome and proteome organizational information, such as protein secondary structures [
62,
63] and amino acid solvent accessibility [
64], tend to outperform those that use simple models in dealing with systematic errors in phylogenomic data. One study of animal phylogeny found that by using a site-heterogeneous infinite mixture model CAT-GTR (which is able to adapt to the complexity actually present in the data) [
63], and recoding conventional amino acid sequences into partitioned phylogenomic data by masking biochemically similar and/or highly exchangeable amino acids, analyses of two data sets that contain 89 genes from 62 taxa and 117 genes from 76 taxa both placed sponges as the sister lineage to all other animals, as most previous studies had shown [
65]. The competing hypothesis advocated by two more simplistic phylogenomic studies, which placed comb jellies as the sister lineage of all other animals [
66,
67], turned out to be a result of LBA. In another study [
68], which analyzed a chloroplast data set of 61 proteins from 24 land plants [
69], both
Amborella and
Nymphaea were identified as members of a clade sister to all other angiosperms when three models that considered distinct structural and functional constraints of protein evolution were used (JTT+C20+F+gamma, JTT+PMSF2+gamma, and CAT+GTR+gamma). In contrast,
Amborella alone was found to occupy such a position when a conventional model (JTT+F+gamma) was used [
68]. The former result is consistent with the consensus that has emerged from analyses of slow-evolving mitochondrial genes [
70] and nuclear phylogenomic data with a coalescent model [
71], whereas the latter likely represents an artifact generated by LBA. Both of these cases involve lineages stemming from ancient radiations, and demonstrate that when analyzing large phylogenomic matrices to unravel phylogenetic patterns at highly compressed deep nodes, models incorporating higher-level structural features of protein sequences may offer better resolution than conventional DNA sequences.
Another manifestation of complexity of the nuclear genome is that a vast majority of nuclear genes exist in families of varying sizes, which arose from repeated genome duplications throughout the history of eukaryotes [
72]. These genes have duplicated copies that coalesce back in time at different points. Functional divergence and selective retention/loss of the copies after duplication [
73] produce genes that fail to meet orthology and evolutionary rate constancy criteria of ideal phylogenetic markers. Nevertheless, theoreticians have developed coalescence conceptual frameworks that allow the process of gene duplication and species/lineages cladogenesis to be modeled and reconciled [
74,
75,
76,
77]. Consequently, gene copies that are not strictly orthologous may provide information for species tree reconstruction. As mentioned above, use of a coalescent method in a phylogenomic analysis uncovered the presumably correct position of
Amborella in the angiosperm phylogeny [
71]. However, it has been found that a protocol of “statistical binning”, which seeks to overcome gene tree estimation error by concatenating loci of different coalescent histories into longer multi-locus supergenes, is operating under an assumption/model that is often violated, as >92% of supergenes comprise discordant loci [
78]. In an ideal situation, the supergene tree set should be an accurate estimate of the true underlying gene tree distribution. In this regard, it is not certain whether the bryophyte monophyly topology obtained by two phylogenomic studies of land plant phylogeny in coalescent model-based analyses [
44,
46] was caused by this model violation. More theoretical clarification on data matrix parameters is needed before any firm conclusions can be drawn on conditions under which coalescent models can or cannot perform optimally to reconcile the discrepancy between gene trees and species trees. What should be added here is that organellar genes generally do not have this homology-orthology confounding problem, as they are all single-copy genes that date back to at least the eubacterial ancestors of both chloroplasts and mitochondria.
Model violation has been reported to be rather prevalent in phylogenetic studies [
78,
79]. This phenomenon may be caused by more higher-level structural organization features of the genome/proteome than codons and exons/protein secondary structures, e.g., protein tertiary structures and interaction networks [
80,
81], or paralogy-orthology [
82] mixing of many genes in the nuclear genome. Further challenges lie ahead to develop more realistic models that can handle large complex phylogenomic data sets as more taxa and genes become available for analysis.
In light of the crucial role played by models in analyses of large phylogenomic matrices, it may be helpful to examine how model choices were made in several land plant phylogenomic studies reviewed earlier, so as to understand why they almost all produced the bryophyte monophyly topology. Three studies used the CAT-GTR model, but because it was computationally costly [
68,
83], adjustment was made to cut taxon and/or character numbers [
44,
46,
48]. This compromise between model and data effectively reduced the matrix size considerably and might have undermined the merit of genome-wide extensive character sampling of phylogenomic studies. Further, no protein secondary structural constraints were introduced together with the model in their analyses. Four other studies used less sophisticated models, GTR+gamma for nucleotide data [
49] and JTT, JTTF, and JTTDCMUT [
49], LG+C60+G+F [
50,
52], and LG+C20+F+R5 [
51] for amino acid sequences, presumably because they represented computationally implementable choices. Only one study employed the CAT-GTR model in a Bayesian analysis of a large amino acid sequence matrix of 160 genes and 177 taxa, but again without adding protein secondary structural constraints to the model [
52]. These model choices and implementation modification raise a question on accuracy of the phylogenetic relationships reconstructed in these studies.
As to how the orthology assumption was followed in these land plant phylogenomic studies, it is worth noting that the data analyzed by two original and several follow-up studies were largely from organ-specific transcriptomes [
44,
46,
49,
50,
51,
52]. It is questionable whether sequences of many so-called “single-copy” genes are truly orthologous across such a broad time span, because during colonization of the land and subsequent diversification into both terrestrial and aquatic niches plants went through many rounds of whole genome duplication followed by loss of some copies and retention of tissue-specific copies [
73,
84]. This gene sampling strategy again casts doubt on validity of the conclusions reached from these studies.
From the above discussion, it appears that although phylogenomics has the potential to locate all historical information present in genomes, there is also a huge amount of confounding variation to be sorted out, especially for these deep reconstruction issues. It will be some time before its power can be fully realized to resolve many deep phylogenetic nodes, as the current models and computational resources still cannot match the complexity and ever-increasing sizes of phylogenomic data sets. Therefore, it is time to re-think the practice of uncritically constructing a mammoth-sized super-matrix and relying on supercomputers to deliver “correct” results. The time-tested strategy of constructing a smaller but better-curated data set, by carefully and rigorously selecting orthologous characters/genes as well as taxa that are suitable to the questions attempted, still should be a critical part of a good study design.
Several specific recommendations may be worth consideration. First, for all sequenced genomes, only those that represent diversity as evenly as possible spanning the study group need to be included. Extra taxon sampling should be concentrated on lineages that (based on previous work) are thought to branch relatively closely to the deep nodes of interest in a particular study; adding many lineages nested in recent clades will not particularly help. Second, among all single-copy genes, hopefully with only orthologous sequences, that can be collected from sequenced genomes, different rate categories can be established even at the alignment stage. Slow-evolving genes can form a core data set, and fast-evolving genes should be either removed or at least analyzed separately so that they do not unnecessarily burden the analysis with LBA. Since chloroplast and mitochondrial genes are generally slow-evolving and have been used extensively in the early stage of plant molecular systematics, it is always advisable to include them in analyses (no matter how many nuclear genes are included) so that comparative analyses can be conducted to monitor performance of the genes from three cellular genome compartments. Third, it may be a good idea to divide the genes according to their functions, preferably after the rate categories are established, as this kind of data partition can help to identify function-specific selective forces that might have distorted phylogenetic signals. For example, in the study of early land plant phylogeny, it is possible that genes related to gametophyte development might have undergone convergent evolution due to the similar life cycle of the three bryophyte lineages and thus produced the bryophyte monophyly topology. Rate comparison of these genes and those that support the bryophyte-paraphyly topology might identify the cause of phylogenetic incongruence. Fourth, different models should be applied to genes of different rate categories to identify the best data-model fit sets for analyses, even though some may perform better than others in some aspect such as running time.
Recently, an investigation of the thorny rooting issue of the eukaryote phylogeny showed that when eight different models were applied in analyses of a matrix of 183 eukaryotic proteins of archaeal ancestry from 185 taxa, all analyses were all able to obtain the presumably correct result, placing four excavate taxa, Parabasalia, Fornicata, Preaxostyla, and Discoba as serial sister lineages to other eukaryotes. Nevertheless, one analysis with a protein structure partition model, which recognized six categories of buried and exposed helices, sheets and loops of secondary structures, took only a fraction of run time used by analyses with seven other common models [
83]. Obtaining the same result using different models also satisfies one of the gold criteria of scientific studies, reproducibility [
85], hence increasing the likelihood that the result represents reconstruction of a historical divergence event. On the other hand, if some models are found to fit certain matrices better than others, and when different results are produced, the rate and functional characteristics of the matrices should offer explanations on which topology represents the historical diversification pattern and which topologies are analytical artifacts. Explanation of homoplasy is one of the fundamental goals of systematics and deepens understanding of evolution. In this regard, it is worth noting that several land plant phylogenomic studies obtained multiple conflicting topologies of bryophyte relationships and could not provide clear evidence to support a firm conclusion because of lack of diagnostic information on the matrices as all data were amalgamated into one super-matrix [
44,
46,
51].
3.3. Genome Structural Characters
Genome structural characters refer to non-point mutations in genomes, such as gene order or syntenic block changes, insertions/deletions in genes or on chromosomes, gene/genome duplication events, intron gains/losses, and
cis- to
trans-splicing changes of group I or II introns. In the early years of molecular systematics, thanks to their rarity these mutations were used as special characters to resolve phylogenetic issues using the maximum parsimony principle [
20,
21,
22,
86]. As more genomes were sequenced, such characters became more abundant and thus could form a separate matrix for reconstructing phylogeny of a group in parallel to sequence characters [
23,
25,
36,
39]. Mitochondrial and chloroplast genomes, being fundamentally prokaryotic in nature with extremely low levels of recombination in most eukaryotes, have been the major sources of such characters for both plants [
23,
24,
36,
39] and animals [
86,
87]. The nuclear genome, being much larger and also having more rearrangements of gene order than its organellar counterparts, has not yet been broadly amenable to such type of analysis. However, recent development in bioinformatics has revealed a large number of microsyntenic characters in angiosperms through genome-wide screening of gene order at a fine scale [
88]. Further, inference of the ancestral timing of genome duplication by examining distribution patterns of duplicated genes among sequenced nuclear genomes has made it possible to determine relationships among early land plant lineages without having to use an outgroup [
52]. Finally, the Archaea sister group of Eukaryota has also been resolved using shared presence of several eukaryote signature protein genes (actin and tubulin, archaeal cell division proteins related to the eukaryotic Endosomal Sorting Complexes Required for Transport (ESCRT)-III complex, and several information-processing proteins involved in transcription and translation) [
89]. Hence, the nuclear genome is likely to become a rich source of such structural characters for resolving difficult deep issues in the tree of life as more species are fully sequenced.
One major strength of genome structural characters is their conservative nature of evolution, which would tend to result in low levels of homoplasy in the sort of deep analyses being discussed in this review. This merit is partly derived from a pre-screening process during the matrix assembly, where fast-evolving changes in genomes are either excluded or simply unrecognized when their frequency and extent of changes are high and substantial. A second strength of these characters is their complexity – there are diverse mechanisms for genome structural changes from recombination to intron-splicing – yet having a simple mode of transformation, having merely two states – presence or absence. Some complex changes in genome structure can be broken into multiple characters, with coding made simpler. Thus, the clarity in character and character state definition inherently leads to low homoplasy levels in the matrices of this type of characters, as long as their rate of change is slow. They are also less prone to being affected by LBA in comparison to nucleotide and amino acid sequences, which often suffer from this problem due to compositional bias in ancient lineages involved in water-land environment transition [
32,
37,
38,
39,
41]. Third, genome structural characters are usually independent of each other, represent a different kind of genetic variation than point mutations, and are sampled across the entire genome, presumably randomly. Thus, they satisfy the independence criterion of ideal phylogenetic characters [
90,
91] better than sequence characters from a gene or a gene-network. Taken together, these features of genome structural characters make them a class of unique and distinct characters that can contribute to reconstructing organismal phylogenies together with DNA sequences and morphological characters.
The major weakness of genome structural characters, that they are few in numbers, comes from one of their strengths, their rarity. Many characters of this category go undetected until variants from newly sequenced genes or genomes are exposed and their phylogenetic informativeness is manifested. A certain amount of prior knowledge on gene and genome evolution is required before the matrix assembly. To exploit the large numbers of cryptic microsyntenic characters in the nuclear genome, as shown in the recent study of angiosperm phylogeny [
88], sophisticated niche bioinformatic tools are a pre-requisite. Hence, genome structural characters, despite having a unique role to play in phylogenetic reconstruction, currently remain as a secondary source of information to the massive number of sequence characters, but hopefully their use will increase in importance in the future.
In the study of phylogeny of early land plants, genome structural characters have played a distinctive and prominent role in resolving some highly contentious issues. As reviewed earlier, such characters from both mitochondrial and chloroplast genomes were discovered and when analyzed supported the paraphyly of bryophytes, with liverworts being sister to all other land plants (
Figure 1A &
Figure 2Q) [
22,
23,
24,
25,
26]. Most recently, two analyses of a data set of 24 high-quality embryophyte nuclear genomes were conducted to infer root placement on the land plant phylogeny without including an algal outgroup, one using the method ALE (amalgamated likelihood estimation) [
53] and the other with STRIDE (species tree root inference from gene duplication events) [
54]. The former identified three possible roots of the land plant phylogeny: hornworts (
Figure 1M), mosses (
Figure 1F), and all bryophytes (
Figure 1N), and the latter assigned 0.2%, 39.9% and 59.8% probability to hornworts (
Figure 1M), all bryophytes (
Figure 1N), and liverworts (
Figure 1A) separately as the root [
52]. Despite some ambiguity, it seemed that there was signal in these nuclear gene duplication events that could shed light on the relationships of early land plants. Moreover, the microsyntenic characters amassed in the nuclear genome of angiosperms recently [
88] should be able to be extended to all streptophytes, as all key lineages of green algae and land plants have had representative genomes sequenced. Ultimately, genome structural characters, given their unique characteristics, should be able to provide a distinct perspective, complementary to that of sequence characters and morphology, on diversification patterns of early land plants, as the five long branches separating green algae, each of the three bryophyte lineages, and vascular plants will continue to haunt phylogenetic algorithms for analyzing sequence data [
92].