Preprint
Article

The Structure of Evolutionary Model Space for Proteins Across the Tree of Life

This version is not peer-reviewed.

Submitted:

20 December 2022

Posted:

21 December 2022

You are already at the latest version

A peer-reviewed article of this preprint also exists.

Abstract
The factors that determine the relative rates of amino acid substitution during protein evolution are complex and they are known to vary among taxa. We estimated relative exchangeabilities for pairs of amino acids from clades spread across the tree of life and assessed the historical signal in the distances among these clade-specific models. We trained these models separately on collections of arbitrarily selected protein alignments and on ribosomal protein alignments. In both cases we found a clear separation between the models trained using multiple sequence alignments from bacterial clades and the models trained on archaeal and eukaryotic data. We assessed the predictive power of our novel clade-specific models of sequence evolution by asking whether fit to the models could be used to identify the source of multiple sequence alignments. Model fit was generally able to classify protein alignments correctly at the level of domain (bacterial versus archaeal), but the accuracy of classification at finer scales was much lower. The only exceptions to this were the relatively high classification accuracy for two archaeal lineages: Halobacteriaceae and Thermoprotei. Genomic GC content had a modest impact on relative exchangeabilities despite having a large impact on amino acid frequencies. Relative exchangeabilities involving aromatic residues exhibited the largest differences among models. There were a small number of exchangeabilities that exhibited large differences in comparisons among major clades and between generalized models and ribosomal protein models. Taken as a whole, these results reveal that a small number of relative exchangeabilities are responsible for much of the structure of the “model space” for protein sequence evolution. If we look beyond the information that these clade-specific models reveal about protein evolution the models themselves are likely to be useful tools for phylogenomic inference across the tree of life.
Keywords: 
Subject: 
Biology and Life Sciences  -   Biochemistry and Molecular Biology
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Alerts
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2025 MDPI (Basel, Switzerland) unless otherwise stated