1. Introduction
Intrinsically disordered regions (IDRs) are functionally important regions of proteins that lack stable structural integrity, are abundant in eukaryotic and viral proteomes, and are widely present in archaea and bacteria [
1]. In addition to their deviant sequence behavior, IDRs display distinctive biophysical properties in terms of sequential, structural and spatiotemporal heterogeneity to qualify as ‘edge of chaos’ systems [
2]. The flexibility of IDRs due to such heterogenous properties endows them with a functional advantage over their structured counterparts that enables participation in complex biological functions, such as recognition, regulation and signaling [
3]. However, the behavior of these ‘edge of chaos’ systems is sensitive to environmental perturbations and mutations that can lead to misidentification and mis-signaling. Such dysfunction of IDRs has been observed to play a role in amyloidosis, cancer, cardiovascular disorders, and neurodegenerative diseases [
3]. Therefore, evolutionary forces act on such ‘non- regular secondary structure regions’ to preserve biological function [
4,
5].
Remarkably, IDR dynamic behavior is evolutionarily conserved, despite low sequence conservation [
5]. Furthermore, flexibility is conserved in proteins [
6]. Thus, measuring intrinsic disorder can quantify the inherent flexibility of proteins. Protein loops, the major contributors to structural flexibility, are a source of functional heterogeneity and are therefore important to understanding the relationship between function and flexibility [
7,
8]. Furthermore, the functional activities of proteins have been proposed to be determined by the molecular functions of loops known as ‘elementary functional loops’ (EFLs) [
9]. The EFLs are enriched in amino acid residues responsible for a specific function, with abundant sets of prototypes, including the p-loop prototype responsible for a majority of enzymatic functions [
10]. EFLs have proven useful in studying evolution of protein function in archaeal organisms, suggesting the use of loop classification systems are a promising route to understanding functional innovation by the reuse of such components in different molecular contexts [
11]. In fact, coupling EFLs with network science has provided evolutionary insights into the formation of complex protein structures through the recruitment of loops [
12]. Disorder in proteins has been extensively studied at the proteomic level [
13,
14,
15,
16] and to a limited extent at the protein domain level [
17]. However, disorder in loop structures, one of the most granular levels of the hierarchy of molecular structure, remains least explored despite being fundamental contributors to the flexibility in proteins.
A previous exploration of contact order in proteins, which is correlated to structural flexibility, showed that there are important evolutionary constraints acting on folding speed [
18]. It showed folding speed increases in evolution. An evolutionary study of loops with network approaches that traced the birth of structural domains from loop structures has been conducted in a separate study [
19]. Here, we investigate the evolution of disorder at the protein loop level, one of the lowest levels of organization in biological molecular systems. We surveyed thousands of loop prototypes derived from ArchDB [
20] and traced their evolutionary history by mapping them to the history of the corresponding domains defined at the fold family (FF) level of structural abstraction of SCOP [
21]. This evolutionary history is based on reliable phylogenomic reconstruction methods that are relatively robust to high mutation rates, horizontal gene transfer, and genetic mosaicism when compared to traditional sequence methods [
22].
2. Materials and Methods
We performed intrinsic disorder analysis of loop structures associated with loop prototypes classified by the ArchDB database [
20]. Loop prototypes (
Figure 1) define the ArchDB classification based on a set of geometric properties, with the following naming scheme (
Figure 1a): clustering method used, ‘type’ of bracing secondary structures (
Table 1), length of the unstructured loop region between the bracing secondary structures, class and subclass [
20]. Two types of clustering methods have been used for classification in ArchDB: Density Search (DS) and Markov Clustering (MCL). Both methods classify loop lengths differently. The DS algorithm is stringent with classification of loops because it allows only fixed ‘length’ of loops to be grouped together, while MCL allows for variation. A class clusters loops with the same conformation of the loop region, while a subclass groups loops with a common geometry (
Figure 1b).
A loop structure is the region in a protein data bank (PDB) structure annotated with a loop prototype, named by its parent PDB structure, chain and location of its first residue in the parent structure; e.g., the loop in
Figure 1b is part of chain A of PDB entry 4ETP, beginning at residue 448.
The loop structural dataset of ArchDB [
20] holds 190,573 classified loop structures out of a total of 306,726 reported loops. The dataset associated with Density Search (DS) loop prototypes, which holds 125,824 loops, was filtered using mappings of FFs to loop prototypes at e-value < 0.001. This resulted in 88,321 loop structures corresponding to 7,110 unique DS prototypes. Note that each loop structure in ArchDB has one loop prototype annotation associated with it for a particular classification system (DS, in our case). However, many-to-many annotations exist between loop prototypes and SCOP FFs. We mapped the SCOP FFs from ArchDB to those in our phylogenomic timeline, followed by retaining loop prototypes mapped to only one SCOP FF. This resulted in 5,125 loop prototypes mapped to 1,965 FFs (
Table S1). We then transferred times of origin of SCOP FFs to the associated loop prototypes as previously described [
19]. These evolutionary ages were measured as node distances (
nd) extracted from a published phylogenomic tree reconstructed from a genomic census of FFs in 8,127 proteomes belonging to the three superkingdoms of life and viruses [
23], using thoroughly tested phylogenomic protocols [
24,
25]. Cellular organisms were represented by 139 archaeal, 1,734 bacterial, and 210 eukaryal proteomes. The virus supergroup was represented by 6,044 viral proteomes [
26].
Figure S1 describes the general experimental workflow that was utilized to build the published phylogenomic tree and the annotated loop chronologies of this study.
Structural disorder was computed using a local copy of the IUPRED software with the ‘short’ disorder option [
27]. A residue was categorized as disordered if it scored above a threshold of 0.5. Disorder of a loop structure was calculated as a fraction of the disordered residues to the total number of residues. The mean disorder for each loop prototype was the average of disorder scores for individual loop structures associated with each loop prototype:
A loop prototype was classified as ‘ordered’ if its mean disorder score was from 0 to 0.1, ‘moderate disorder’ with a mean disorder score from 0.1 to 0.3 and ‘high disorder’ with scores greater than 0.3.
3. Results
We conducted disorder analysis on 5,125 loop prototypes associated with 1,965 FFs. FFs were annotated with times of origin (evolutionary ages given as
nd values) derived from a genomic census of 8,127 proteomes from the three superkingdoms and viruses. The evolutionary ages are based on phylogenomic methods benchmarked by well over a decade of research and experimentation [
28,
29,
30,
31,
32,
33,
34]. We inspected disorder and various structural and geometric properties of loop prototypes in superkingdoms and viruses, indexed their associated molecular functions, and explored the evolutionary spread of prototypes in a phylogenomic timeline.
There appears to be a sharp decline in mean disorder scores with an increase in mean loop structure length (
Figure 2a). However, while the median of mean disorder scores gradually increased with time of origin, the median of mean length of the loop structure was steady throughout the timeline (
Figure 2b). As expected, loops with high disorder outnumbered those with moderate and low disorder as their accumulation rates increased and decreased in the timeline (
Figure 2c). Interestingly, the medians of mean disorder score of loop prototypes in SCOP classes showed a general increase with age (
Figure 2b). Following a rejection of the null hypothesis of all medians being the same by the Kruskal-Wallis H test [
35], the Conover’s test of multiple comparisons [
36] indicated that the pairwise comparison of the four major classes of domains in SCOP, namely, all-α, all-β, α+β, and α/β, showed a significant difference in medians (
Table 2). Mean disorder increased according to the sequence: all-α < α+β < α/β < all-β (
Figure 2d). Moreover, out of the 48 ‘ordered’ loop prototypes, 18 belonged to FFs from the α/β class, followed by 11 from all-α, 7 from α+β, 5 from all-β, and 7 belonging to rest of the classes (
Table S2).
A four-set Venn diagram of loop prototypes in superkingdoms and viruses showed a high number of loop prototypes associated with the ABEV and ABE Venn groups (
Figure 3a). The α-β type ‘HE’ claimed the highest percentages of loop types present in each superkingdom and the viral supergroup (
Figure 3b). The distribution of loop types appeared to follow a similar trend for all superkingdoms and viruses.
A closer inspection of the distribution of loop prototypes belonging to FFs of each Venn group (
Figure 4) along the evolutionary timeline revealed patterns of first origin matching those observed for FFs. As a general trend, ‘ordered’ prototypes (
Table S2) appeared later than high and moderate disorder prototypes, with 39 ‘ordered’ prototypes appearing around and after
nd = 0.4. Highly and moderately disordered prototypes appeared concurrently (roughly at a same time) for the ABEV, ABE, EV, and V groups. However, moderately disordered prototypes appeared earlier than highly disordered ones in the FFs of the BE group. Interestingly, the A Venn group had only high disorder loop prototypes. Evolutionary tracings also showed that while high disorder prototypes were present in all Venn groups (except AV), 24 of the 48 ‘ordered’ prototypes were only present in the FFs of the ABEV group, followed by 13 present in ABE, 4 in BEV, 2 in AB, 2 in BE, 2 in E and 1 in ABV (
Table S2).
The distribution of loop structural ‘types’ along the evolutionary timeline showed that all types appeared very early in evolution. The ‘DS.EH.6.17.1’ and ‘DS.EH.7.6.1’ prototypes appeared the earliest together with the most ancient ‘ABC transporter ATPase domain-like’ FF (c.37.1.12) (
Figure 5). Highly and moderately disordered prototype types α-α (HH) and β-α (EH) appeared approximately at the same time. Except for the helix 3
10-helix 3
10 (GG) prototypes, all other ‘types’ had both ordered and moderately disordered prototypes. For the remaining seven types, highly ordered prototypes appeared earlier than both moderately disordered and ordered prototypes. There were 13 ‘ordered’ prototypes associated with the HH type, followed by 9 with EH, 7 with BK, 6 with HE, 5 with EG, 4 with BN, 2 with GE, and 1 each with HG and GH.
The median values for mean disorder scores by structural type were the highest for the helix 3
10-containing GG, EG, and GE prototypes, with a left skew in their respective distributions (
Figure 6a). To assess whether higher disorder scores for specific types were associated with the molecular function of the FFs they belong to, we inspected the distribution of structural types for each molecular function (
Figure 6b). Some of the functional categories showed a preference for certain types of prototypes. The β-β hairpin (BN) type comprised the highest number of prototypes present in FFs belonging to the ‘Intracellular processes’, ‘Extracellular processes’, and ‘Other’ categories. The α-α (HH) type dominated the distribution in FFs in both ‘Information’ and ‘Regulation’. The FFs associated with the ‘General’ and ‘Metabolism’ functional categories were associated with a high number of EH and HE types, respectively.
The survey of loop prototypes by disorder categories in molecular function showed that ‘Information’ and ‘Other’ were the only categories with no associated ‘ordered’ prototypes (
Figure 7). Out of the 48 ‘ordered’ prototypes, 29 belonged to ‘Metabolism’ FFs, followed by 7 to ‘Intracellular processes’ FFs, 6 to ‘General’ FFs, 5 to ‘Regulation’ FFs and 1 to ‘Extracellular processes’ FFs (
Table 3). A Gene Ontology (GO) enrichment analysis of the FFs with ordered prototypes showed that these FFs are highly enriched in activities mainly related to metabolism, transport, and DNA transcription as well as pathogenesis and immune response (
Table 3). Highly and moderately disordered prototypes appeared approximately at the same time in the FFs belonging to ‘Metabolism’ and ‘General’. For FFs with ‘Regulation’, ‘Intracellular processes’, and ‘Extracellular processes’ molecular functions, highly disordered prototypes appeared earlier than moderately disordered and ordered prototypes.
Loop prototypes with smaller lengths of the loop region, ranging 1–7, were widespread throughout evolutionary time, while longer prototypes were part of FFs that appeared relatively late in evolution (
Figure 8a). The average length for N-terminus and C-terminus of prototypes showed consistent distribution with little variation throughout the timeline. Similarly, geometric properties of prototypes, namely, hoist (delta) and packing (theta) angles, and distance were spread consistently throughout evolutionary time. However, the median values for meridian (rho) angles showed an increase with time, while the Euclidean distance (D) between the boundaries of aperiodic structures showed a slight decrease (
Figure 8b).
5. Conclusions
Our study does not exclusively address intrinsic disorder. Instead, it focuses on studying disorder as proxy for surveying protein flexibility, an approach taken by other recent studies [
43]. Results provide a deeper evolutionary view of the link between structure, disorder, flexibility and function. First, ancient loop prototypes tended to be more disordered than their derived counterparts, with ordered prototypes developing later in evolution. This highlights the central evolutionary role of disorder and flexibility. Second, there was an unexpected emergence of ordered prototypes early in evolutionary history, possibly driven by the need to preserve specific molecular functions. Third, the study uncovered percolation of evolutionary constraints from higher to lower levels of biological organization. This percolation influenced the spread of disorder in prototypes. Fourth, the analysis revealed trade-offs between flexibility and rigidity in loop behavior, with different functional categories preferring specific structural types. Finally, tracing the evolution of structural and geometric properties of loops revealed variations in loop length and geometry along the evolutionary chronology of prototypes. These findings provide valuable insights into the role of protein loops in evolution and their contribution to protein structure and function.
We conclude by acknowledging some limitations of our study. First, the accuracy of the disorder analysis relies on the precision of the available software, which introduces the possibility of false negatives and false positives in our analyses. Second, biases within databases, such as the presence of disordered structures in the PDB and, consequently, in ArchDB, may also act as limiting factors. Lastly, there are many-to-many mappings between loop prototypes and FFs with varying degrees of e-values in ArchDB. In our study, we opted for a stringent e-value of < 0.001, resulting in prototypes being mapped to only one FF at this e-value. While this choice may lead to missing some hits, it helps mitigate issues that could arise from a high number of false positives. In the future, addressing these limitations can be achieved by expanding database knowledge and enhancing prediction software accuracy.