1. Introduction
The Coronavirus Disease 19 (COVID-19) pandemic has led to a total of 775,645,882 reported cases and 7,051,876 deaths worldwide [
1]. To put these numbers into perspective the death toll alone is more than the populations of several US states such as Montana, Rhode Island, Delaware, South Dakota, North Dakota, Alaska, the District of Columbia, Vermont, and Wyoming combined [
2]. The total number of cases exceeds the population of the entirety of Europe by ~28 million people [
3]. This makes COVID-19 one of the deadliest pandemics to hit the human population in the 21st century surpassing the impact of the Middle East Respiratory Syndrome (MERS) (September 2012, Jeddah, Saudi Arabia) and Severe Acute Respiratory Syndrome (SARS) (March 2003, Guangdong, China) coronaviruses [
4,
5], which had a total death toll of 774 and 949 deaths worldwide respectively [
6,
7]. This is remarkable because COVID-19 has a much lower mortality rate of 5.19% compared to 13% for SARS and up to 35% for MERS [
8]. The disproportionate death tolls amongst these pandemics caused by coronaviruses can be attributed to the asymptomatic phase of COVID-19, which SARS and MERS lack [
9]. Asymptomatic patients can carry on with their daily routines without even knowing that they are carriers of similar viral loads as their symptomatic counterparts enabling high transmission rates within a population [
10]. This challenges disease control and mitigation as asymptomatic individuals spread disease to the uninfected. The most extreme case of asymptomatic COVID-19 occurrence was the “Diamond Princess” cruise ship population in Japan, which experienced 641 COVID-19 infections out of 3,711 passengers, with over half (328) of those infected being asymptomatic [
11]. This makes the personal responsibility of an individual the ‘number one’ factor in preventing the spread of COVID-19 through masking and other precautionary measures [
12] and poses an even greater challenge for policymakers to ensure the proper measures are in place to prevent the spread of COVID-19.
The COVID-19 disease is caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), which has a 79.6% sequence similarity to SARS-CoV-1, the causative agent of the first SARS outbreak [
13]. Remarkably, 96% of its full-length ~30 kb genome sequence is identical to the sequence of a bat coronavirus [
13]. The SARS-CoV-2 genome encodes 29 proteins, including 4 structural proteins crucial to viral function, the Spike (S), Nucleocapsid (N), Membrane (M), and Envelope (E) proteins. The evolution of the S-protein and M-protein and its effects on their structure have been previously reported [
14,
15]. In contrast, no significant changes were detected in the structure of the E-protein. The N-protein packages the positive-sense RNA genome of the virus forming the ribonucleoprotein structures of the viral capsid. The protein is necessary for viral assembly and RNA synthesis and participates in several cellular processes affecting immunological and cell cycle responses of the host [
16]. The first mutations targeting the N-protein were found to be genetically linked via a haplotype (haplotype H2) [
17]. The worldwide appearance of H2 very early in the pandemic followed the spread of the first reported haplotype (H5), which harbored the S-protein ‘D614G’ mutation. The N-protein is comprised of 5 distinct regions, two of which are structured and are known as the N-Terminal Domain (NTD) (residues 44-174) and the C-Terminal Domain (CTD) (residues 255-364). These domains are connected together by a disordered linker region (LKR) (residues 175-254) and are flanked by Intrinsically Disordered Regions (IDRs), the N-arm (residues 1-43) and the C-tail (residues 365-419) [
18]. The CTD allows the N-protein to self-associate to form dimers while the C-tail mediates higher-order assembly into tetramers [
19]. The disordered nature of the IDRs play a crucial role in binding to viral RNA with high cooperativity by enhancing binding affinity and allostery [
20]. This increased binding affinity to nucleic acids is the result of the flexibility of disordered regions, which allow for multiple nucleic acid binding sites to associate with the same nucleic acid in an optimized allosteric conformation [
21]. Both the NTD and CTD bind to viral RNA to form the nucleoprotein core of the virus [
18,
22]. The N-protein also interacts with the M-protein to fix ribonucleoproteins to the viral membrane [
23]. Aside from its role in viral genome packaging, the N-protein also regulates antiviral immunity by the induction of interferon responses [
24]. Following the formation of the ribonucleoprotein core at the ER-Golgi intermediate compartment (ERGIC) where the rest of the structural proteins are assembled, the budding process begins with proteins and viral RNA entering the lumen of secretory vesicles followed by transport of assembled virions out of the host cell via exocytosis [
25,
26,
27]. There is considerable cooperativity between the four structural proteins of SARS-CoV-2. This enables efficient viral development and release, which is also demonstrated by the mutational evolutionary landscape. Many of the mutations observed have been dominated by the S, M, and N proteins throughout the pandemic [
14]. These proteins are therefore important targets for vaccine and drug development.
Variants of Concern (VOCs) have been replacing each other since the beginning of the COVID-19 pandemic [
28]. Their mutant constellations hold ‘mutations of concern’ that are of immediate priority for surveillance and response. The effect of these mutations on the protein sequence must be linked to effects at 3-dimensional (3D) atomic structure level to dissect the functional significance of individual VOCs and associated haplotypes. Three main strategies model protein structure: homology modeling, fold recognition, and
ab initio methodologies [
29]. Homology modeling and fold recognition rely on existing sequence and folded structure data and are rather comparative in nature. These methods can be limited in their ability to accurately predict the true 3D atomic structure of novel proteins, especially in molecular systems subjected to fast mutation rates.
Ab initio methods, however, do not use pre-existing knowledge. Instead, they build models directly from amino acid sequences and the stoichiometric constraints of those sequences. Such an approach is especially useful for modeling proteins with low homology. AlphaFold2 [
30] is the star of the last two biannual
ab initio structure prediction experiments (CASP 14 and 15)[
31,
32]. Its deep learning algorithm makes fast atomic structural predictions with levels of accuracy that are within the margin of error of experimental structure determination methods. Crucially, this reduces reliance on traditional crystallographic and cryo-EM methods that are time-consuming. In the absence of experimental protein structures from VOCs, numerous studies utilized AlphaFold2 to explore the differences between the wild-type (WT) Wuhan strain, which is used as reference, and the emerging variants [
15,
33,
34,
35]. However, variant definitions do not reflect the complete viral landscape of SARS-CoV-2. Other groups of mutations can occur in greater frequencies than VOC constellations and VOCs often embody latitude-delimited haplotypes that have their own unique accumulation profiles across climatic zones [
14,
36]. These haplotypes, which were identified following a study of over 12 million viral proteomes, uncover seasonal patterns of emergence and help link structural conformations to environmental factors revealing an interplay between viral evolution, our environment, and our immune systems [
14]. In fact, COVID-19 epidemiological variables exhibit significant negative correlation with temperature and positive correlation with latitude, often peaking around winter months, further indicating the seasonal nature of the virus [
36,
37,
38].
Here we use AlphaFold2 to model the 3D structures of mutant N-protein molecules defining SARS-CoV-2 haplotypes and constellations (
Table S1). We also study the effect of mutations on the regions of intrinsic disorder of the molecule. Studying the structural changes observed across the pandemic uncovered patterns of structural recruitment across haplotypes and VOCs indicative of a complex interplay between the virus and its environment that mediates viral evolution. We show the N-protein was impacted by many haplotypes, beginning with H2 and its effects on the LKR region and ending with the rise of the VOC Omicron constellation and cooperative effects on protein structure. Our study highlights the importance of
ab initio techniques and the utility of AlphaFold2 for comparative structural studies of proteins without any experimentally determined atomic structures and containing regions of intrinsic disorder known to be difficult to model by traditional means.
2. Materials and Methods
Accelerated
ab initio modeling of 3D atomic structures of the N-protein was conducted using the AlphaFold2 pipeline [
30] implemented locally in ColabFold without changes or modifications [
39]. The output of five ranked structural models was obtained following twelve neural network recycles (processing of predictions through models) that iteratively extracted co-evolutionary information in PDB70 structural templates and multiple sequence alignments (MSAs) for end-to-end training of the deep learning ‘evoformer’ and ‘structure’ multi-layered neural network modules. MSAs were built with fast and sensitive MMseqs2-based homology searches of UniRef100 and a database of environmental sequences. Accuracy was measured with the predicted local distance difference test (pLDDT) and the predicted aligned error (PAE). pLDDT provides a per-residue estimate of prediction confidence based on the LDDT-Cα metric [
40]. The expected prediction reliability of a given region or molecule follows pLDDT ‘confidence bands’: >90, models with very high confidence; 90-70, models with confidence, showing good backbone predictions; 70-50, models with low confidence; and <50, models with very low confidence, generally showing ribbon-like structures. pLDDT <60 can be considered a reasonably strong predictor of intrinsic disorder. PAE measures confidence in the relative positions of pairs of residues, which evaluates the cohesiveness of structural modules (e.g., domains).
Structural alignments and visualizations were carried out using Chimera [
41]. Reference (corresponding to EPI_ISL_402124) and variant structures were superimposed using the MatchMaker and MatchAlign tools to identify regions with structural divergences. Topological similarities of individual regions or entire molecules were evaluated with average template modeling scores (TM-scores) using USalign [
42,
43]. N-protein predictions were benchmarked against cryo-EM models of the two structural domains [
18]. Besides TM-scores, Global Distance Test – Total Score (GDT-TS) scores were obtained using the LGA (local-global alignment) structure comparative analysis tool with the AS2TS server [
44,
45,
46], which CASP assessors routinely use to evaluate the accuracy of predicted structural models. The data presented in this study for the N-protein are openly available in ModelArchive under accession ma-gca-nprot.
4. Discussion
The N-protein is responsible for binding to the genetic material of the virus (RNA) and packaging it into ribonucleoprotein (RNP) particles [
59]. This is achieved through RNA binding at various locations of the N-protein, which spans the N-arm [
60], NTD [
61], LKR [
60], and CTD [
20,
62,
63] regions. The CTD is also responsible for tetramer formation due to its ability to self-associate with the aid of the C-tail region [
19]. The LKR plays an important role in RNP packaging [
64] and oligomerization of N-protein molecules [
65]. Both of these functions are achieved through biochemical and physical changes brought about by the phosphorylation of the region, which introduces significant electrostatic changes that are required to mediate these processes [
59]. We observed that the most significant structural changes induced by mutant constellations and haplotypes occurred in the LKR region across numerous phases of the COVID-19 pandemic. In contrast, the NTD only saw significant change (including morphological changes apart from shifts in 3D space) in H18 and VOC Omicron due to the triple deletion. Note that both regions R1 and R2 of the N-arm are part of a single epitope ranging from residues 20-59, and R3 is part of an epitope ranging from residues 211 to 235 [
49]. R1 mostly sees significant structural changes in H18 and Omicron as both of those structures harbor the triple deletion which all other structures lack (
Figure S1). This suggests that the structural and mutational changes we observed could be due to the immunological pressures acting throughout the pandemic, forcing the virus to evolve to better ensure its survival. R3 is especially important because not only do we observe structural changes, but they are accompanied by the existence of an epitope that also saw two mutations in the region, including G215C and S235F, as well as exhibiting notable decreases in protein disorder (
Figure 5). Our findings suggest that structural changes are observed in regions of functional significance that help immune evasion and other viral functions, including oligomerization for effective genome packaging.
The N-protein is already a very disordered protein and hence changes are more likely to occur in regions of disorder. However, it is apparent that the NTD and CTD structured regions remain relatively unchanged in sequence and structure except for one mutation (D63G) in the NTD (
Figure 1). Interestingly the most common trend observed was that both protein disorder and binding capability actually reduced for most regions of the N-protein throughout the pandemic with the only notable increase being observed in the LKR for H1, H7 and VOC Alpha. The increase of binding capability was inherited from H1 to Alpha. However, the same was not true for VOC Delta and H7. Haplotypes H7 and H8 seemed to cancel out the impact of the R203M mutation on binding affinity and both R203M and G215C removed the increase in binding capability at the LKR (
Figure 5). This portrays the complex interactions between different mutations of several haplotypes within their VOC constellations as they can behave synergistically or antagonistically to one another. Despite these changes occurring in the regions of disorder, all of the changes we reported showed a decrease in protein disorder except for one very small portion at the triple deletion site in H18 and VOC Omicron.
R2 and R3, which embody established epitopes [
49], were the only two regions that saw structural changes of immunological significance. Both of these regions showed decreases in protein disorder accompanied by decreases in binding capability (
Figure 5). This indicates a mechanism of immune evasion where mutations such as the triple deletion in H18 and VOC Omicron, along with the mutations G215C and S235F, acted as a means to bind less effectively with antibodies. These structural changes, along with those in R1 and R4, appear to be recruited throughout the pandemic in various combinations amongst haplotypes and VOC Omicron. This combinatorial strategy highlights the evolutionary landscape of structural exploration that is unfolding in the N-protein, where combinations of smaller structural changes are adopted that benefit viral function. These changes are not limited to phases of the pandemic either. Earlier structures such as those observed in region R4 of haplotypes H1 and H2, were later found in VOC Omicron. In
Figure S1 we observe that H1, H2, and Omicron have an alpha helix causing significant differences in atomic distances between corresponding atoms. Similarly, R3 revealed a similar recruitment pattern spanning the entire length of the pandemic, with the same sub-structure being found in H2, H7, and VOC Omicron. H2, H7, and Omicron have an extended alpha helix that is part of the bigger alpha helix downstream of the structure and is slightly shifted from the original axis of the Wuhan structure (
Figure S1). H1 and Delta also adopted extended alpha helices in R3 but were not shifted from the original axis. The exploration of various sub-structures at specific regions of a protein can be thought of as a metric of structural entropy where higher numbers of structural conformations indicate higher levels of structural entropy. Most of the changes in the N-protein did not show much variation in the sub-structures but only in the combination of these sub-structures with one another. Since the same structure was recruited at regions R3 and R4 along the length of the pandemic, we consider these to be examples of entropic fixations. Here, structural entropy does not expand as numerous conformations are not being explored, indicating the N-protein adopted a more targeted evolutionary approach when compared to the S-protein [
15]. These structural recruitments reveal patterns of synergy and antagonism during the pandemic. VOCs Alpha and Delta behaved antagonistically to the structural effects of their constituent haplotypes undoing their structural impacts. VOC Omicron, in turn, amplified or retained the individual effects of its constituent haplotypes. These effects mimic those found in our study of the S-protein [
15].