3.3. The Interactome-1060
Figure 1 shows an interactome comprising proteins from the human proteome selected through a sequential selective process that identified those with the highest experimental probability of being involved in the metabolic processes induced by the S1 protein in the human organism. The interactome comprises 1,060 nodes and 17,494 edges, got by selecting the highest significance interactions (confidence score 0.900) and excluding the Text mining channel. This is another important point because of the uncertainty arising from detections of protein interactions, which is reflected in the network's structure [48]. PIN (Protein Interaction Network) analysis should be reproducible, by similar results across different scoring thresholds of calculation systems. This suggests that, for maximum confidence, we need to have a robust metric across the network to have meaningful and reproducible topological results [48]. The topology of the interactome-1060 is complex because of the many peripheral sub-graphs enveloping an extended central core. The various peripheral sub-graphs are connected through a few interactions of specific nodes at the interface with the central body and with each other. We can observe subgraphs (also called communities or molecular modules) densely connected within themselves but poorly connected with the rest of the network. The intensity of the connections and the compactness of each sub-graph suggest they represent molecular complexes that carry out specific and common functional activities [61,62,63]. This broad functional connectivity shows the possibility of an extensive repertoire of responses to stimuli. After all, the cell is a complex multi-agent system programmed to perform predefined functions at specific times. Therefore, the interactome-1060 represents a robust set of human proteins suitable for a reverse engineering approach. With it, we will assess the significance of each single interaction by evaluating its real biological meaning. We believe that this approach has a broader value than the rather reductionist meaning of reverse engineering as a technology [64]. In short, we try to discover the one-to-one interactions of S1 in the network by validating them through external biological information.
Figure 1.
Human interactome generated by seed proteins physically interacting with S1 subunit of SARS-CoV-2. Topological parameters are in Table 2. STRING calculated the interactome.
Figure 1.
Human interactome generated by seed proteins physically interacting with S1 subunit of SARS-CoV-2. Topological parameters are in Table 2. STRING calculated the interactome.
2.3.1. Quantitative Aspects of Interactome-1060 Functional Processes
Table 1 shows an overview of the functional processes activated by interactome-1060.
Table 1.
Functional Processes activated in the human genome by the interactome-1060.
Table 1.
Functional Processes activated in the human genome by the interactome-1060.
Biological Process |
Terms significantly enriched |
|
Biological Process (Gene Ontology) |
1430 terms |
Molecular Function (Gene Ontology) |
165 terms |
Cellular Component (Gene Ontology) |
283 terms |
Reference publications (PubMed) |
>10,000 publications |
Local network cluster (STRING) |
251 clusters |
KEGG Pathways |
202 pathways |
Reactome Pathways |
693 pathways |
WikiPathways |
302 pathways |
Disease-gene associations (DISEASES) |
114 diseases |
Tissue expression (TISSUES) |
167 tissues |
Subcellular localization (COMPARTMENTS) |
287 compartments |
Human Phenotype (Monarch) |
787 phenotypes |
Annotated Keywords (UniProt) |
103 keywords |
Protein Domains (Pfam) |
17 domains |
Protein Domains and Features (InterPro) |
144 domains |
Protein Domains (SMART) |
44 domains |
All enriched terms (without PubMed) |
4989 enriched terms |
The control of so many biological processes by S1 is remarkable. This analysis is based only on experimental data from BioGRID, functionally analyzed by STRING. The calculated interactome is based on select only the most statistically significant interactions (through the highest confidence score) deriving from data and information from over 10,000 scientific articles and from eliminating information from the text mining channel which introduces biases in the data and in information compared to everyone else. Such an approach favors the greatest possible certainty of the interactions in the interactome (Figure 1). However, the more complex a network is with many multi-node interactions, the more intrinsically robust it is with reduced false positive interactions. The Excel file 1, sheet 1 reports the node degrees of the entire interactome-1060. The node with the highest connectivity is RPS27A (nodal degree: 230), a ribosomal protein. Hubs connect multiple nodes to centralize network traffic through a single connection point. Barabasi suggests that the range of degrees for including the HUB nodes should be half the value of this node (65-67). This range includes 65 HUB nodes out of 1060 nodes (6%) from 230 to 115 degrees. A closer inspection shows that these nodes are almost all ribosomal proteins, even if in different roles.
Among these high-ranking nodes, four of them regulate and control many ribosomal activities, showing disproportionately more interactions than other proteins. They are RPS27A, its paralogue UBA52, FAU and RACK1. RPS27A and UBA52 play crucial roles in targeting cellular proteins for degradation by the 26S proteasome, maintaining the chromatin structure, regulating gene expression, and the stress response [68,69]. FAU is a protein contributing to the assembly and functionality of 40S ribosomal subunits in the cytoplasm [70]. It is implicated in ribosomal biogenesis and is associated with various protein complexes, contributing to regulating the cell cycle. While RACK1 is a protein that controls translation directly and acts as a scaffold for signaling to and from the ribosome [71]. Upon viral infection, RACK1 remodels ribosomes so that they become optimal for translating viral mRNAs but not for host mRNAs [72]. Thus, they interface with multiple cellular functions and processes. Here, we focused on their pivotal roles in the synthesis of new proteins. To gain more insight into their activity, we used the STRING action “recenter” that rewires the network around these proteins, showing all the proteins in STRING that interact with them. This specific interactome (figure 5S) reveals a strong connection between the four proteins and their control over the remaining 793 proteins. The functional picture that emerges is that of four essential cytosolic small ribosomal subunits involved in viral mRNA translation (GO:0002181 Cytoplasmic translation, p-value: 4.16x10-85; GO:0042274 Ribosomal small subunit biogenesis, p-value: 3.79x10-45; GO:0006412 Translation, p-value: 2.67x10-194; GO:0006364 rRNA processing, p-value: 2.85x10-88; GO:0042254 Ribosome biogenesis, p-value:1.38x10-109; GO:0022613 Ribonucleoprotein complex biogenesis, p-value: 9.71x10-126; GO:0034660 ncRNA metabolic process, p-value:1.86x10-65; CL:143 Viral mRNA Translation, and Sec61 translocon complex, p-value:1.26x10-91; HSA-192823 Viral mRNA Translation, p-value: 2.39x10-72). All this shows a dynamic ribosome action in mediating crucial cellular mechanisms, even in pathologic states. It is a view quite like that of some authors who contest the traditional view of ribosomes as static and invariable entities [73,74]. To support this consideration, studies have showed that certain ribosomal proteins impede viral action in cultured human cells, leading to changes in human functionalities [75,76,77]
2.3.2. Significant Topological Parameters of the Interactome-1060
Regardless of the deep molecular machinery underlying the functional characteristics, the space that emerges from the analysis of these topological configurations provides a logical substrate for understanding viral strategies. The main topological characteristics of this interactome (see table 2) reveal important principles of cellular organization and functionality. An extensive eccentricity of the network, as shown by the high values of its diameter and radius (10 and 5), suggests functional peripheral subgraphs (or communities). The heterogeneity (1.187) supports a large tendency to have hub nodes [78], while a centralization value close to zero (0.189) supports compact and dense connections within the network. Another interesting parameter is the value of the average clustering coefficient (0 ≤ C ≤ 1) which reflects a modular organization [79] that, at light of the large diameter, also suggests an asymmetric architecture, as we observe it.
Table 2.
- Topological parameters of Interactome-1060*.
Table 2.
- Topological parameters of Interactome-1060*.
Number of nodes |
1060 |
Number of edges |
17493 ** |
Average node degree |
33 |
Avg. local clustering coefficient |
0.679 |
Expected number of edges |
8382 |
PPI enrichment p-value |
<1.0✕10‒16
|
Confidence score |
0.900 |
Source channels |
6 |
Network diameter |
10 |
Network radius |
5 |
Characteristic path-length |
3.717 |
Network heterogeneity |
1.187 |
Network density |
0.33 |
Network centralization |
0.189 |
Connected components |
1*** |
2.3.3. The Power Law of the Interactome-1060
However, before any topological consideration, it is necessary to find out what distribution the interactome degrees show. Biological networks show scale-free behavior with a few hub nodes controlling multiple connections within the system. The lack of an internal scale means that nodes with a large difference of degrees coexist in the same network. Barabasi discovered this feature in many biological networks [83,84], where the fraction of nodes with degree k follows a power-law distribution, showing that the degree distribution of the network is well approximated by Pk ~ k-ƴ, (ƴ > 1). The exponent ƴ is the degree exponent, and many properties of a scale-free network depend on the value of the degree exponent [83,84]. Calculating the degree distribution is an important part of analyzing the properties of a network [85]. Figure 2 shows that the interactome-1060 follows the characteristic distribution law of the nodes of a scale-free network.
Figure 2.
Power law distribution of interactome-1060. The distribution follows a free-scale distribution based on the power law. In the inset, we present the same nodes on a log–log scale, with the best fit of data (f(x) = 181.8x−0.98) shown in red. We calculated the slope (–0.517) on the best fit line in the log-log inset.
Figure 2.
Power law distribution of interactome-1060. The distribution follows a free-scale distribution based on the power law. In the inset, we present the same nodes on a log–log scale, with the best fit of data (f(x) = 181.8x−0.98) shown in red. We calculated the slope (–0.517) on the best fit line in the log-log inset.
To test the distribution, we fitted the function f(x) = axb, where the values of a, b (degree exponent), and R2 are 181.8, ‒0.98, and ‒0.272. Even though the interactome-1060 shows a significant p-value of 1.0✕10−16, the low correlation index of this fit underscores a strong expectation of heterogeneous associations among nodes, such as high clustering. The presence of clusters in the network topological architecture is useful for defining in a non-random way specific pharmacological attack points [86]. We note that nodes with high connectivity form a long tail (long-tailed distribution) and between degrees 30 and 70 there is a peak that characterizes an excessive number of nodes with these degrees compared to the fit. Some protein networks acknowledge the long tail 86he distribution as an intrinsic property rather than a byproduct of the specific algorithm used to compute the network [87]. This is also a characteristic property of scale-free networks that result in distributions with long tails where only the terminal nodes have high degree values [88].
In most real-world networks, new nodes prefer to connect to more connected nodes, according to a process called preferential attachment [89]. Therefore, the number of nodes grows because of the addition of new nodes, so growth and preferential attachment coexist. The power law should represent this tendency of the nodes. If we examine the degree distribution in the log-log graph (inset of fig.2) we find that the distribution deviates from a pure power law, which in logarithmic representation should follow a linear trend. The log-log distribution shows many overlapping linear plateaus in the high-k regime (the long-tail nodes) and a clear distortion in the low-K data. This suggests various subgraphs (molecular modules) each with its own specific hubs. [90].
According to the Barabasi's model [83,84], for b < 2, the exponent ƴ will be larger than one, as in our case. Hence, the number of connections to the largest hub will grow faster up to reach the global size of the network. This means that for very large N (total number of nodes) the highest degree hub could gain all nodes in the network. But this tendency slows down the connection speed [90,91] allowing also other nodes to increase their connectivity. Barabasi unraveled the complexity of this phenomenon, showing how a large scale-free network with b < 2 cannot exist without multi-link subgraphs [83,84]. Networks are not a static entity but grow by adding new nodes. The joint necessity of growth and preferential attachment generates scale-free networks and changing either of these factors will cause changes to the scale-free properties and network topology. Therefore, the growth rate of a hub node depends not only on its age, because other nodes can transform random transient interactions into a long-lived interaction. A common characteristic of these last nodes is an intrinsic property that we will call fitness [92,93]. Fitness is a property that favors the preferential attachment to other nodes by increasing the growth rate of their connectivity [94,95]. It is based on the set of distinctive structural and/or functional characteristics possessed by each node. On this basis, Barabasi has developed a specific model, the "Bianconi-Barabasi Model" or "Fitness Model" [92,96]. This model shows how nodes with different internal characteristics can gain links at different rates. It predicts that the growth rate of a node is determined by its fitness. One can measure the fitness by comparing the node with the temporal evolution of the fitness of other nodes in the network. This model presents a behavior of the nodes that is like that of Bose gas, studied by physicists [97].
This similarity explains very well the physical basis of forming the many independent and dense functional subgraphs observed in protein networks, characterized by their hubs. In fitness distributions, the network exhibits a "fit-get-rich" dynamic, meaning that the degree of each node is determined by its fitness, where new links not only arrive with new nodes but also occur between pre-existing nodes. The fitness model also shows that in many real systems, nodes, and links can change and disappear, explaining why nodes disconnect after enrichment and therefore need to be deleted [98].
Strictly speaking, if the linear preferential attachment governs the growing network, then a pure power law should emerge. However, it is rare to observe a pure power law in actual systems [99]. The Barabasi-Albert model is an idealized model that represents only the starting point for understanding the distribution degree in real networks [100] because fitness plays an important role. The concept of fitness in protein networks refers to the ability of a protein entity to survive and thrive within a protein network, because of its interaction with other proteins and its functional relevance. This plays an important role in the formation of those protein complexes that are crucial for a variety of biological processes. Protein fitness is based on many essential protein properties, such as secondary structure, solubility, binding affinity, flexibility, and functional specificity [101,102,103,104]. Therefore, in Interactomics, we can motivate this model only if there are experimental observations that explain the internal characteristics of the nodes. If we identify them correctly, then we can understand how fitness contributes to the formation of subgraphs and the topological evolution of the network [105]. In networks with many sub-graphs, such as the ones in this study, hubs connect to nodes of a small degree. As a result, we have a network that is unlikely to be represented by a single giant component. Networks in which hubs avoid connecting to each other but connect to many low-degree nodes are called disassortative and generate a hub organization with a hub-and-spoke pattern [106]. The logarithmic representation of the disassortative degree distribution is characteristic and very similar to the one calculated in figure 2. This means that our network (fig.1) has intrinsic troubles to be represented as a single giant component. We can appreciate this feature by measuring the slope of the linearization of its distribution. As we will see later, it is a measure of the speed of growth. The fit shows a negative slope (see figure 2 caption), and a lower probability of mutual interaction (y-axis) characterizes the nodes with the highest rank (x-axis) [107]. In conclusion, many statistically significant subgraphs give us the picture of the fundamental functions that the biological system performs and its dynamics. This allows us to understand with successful certainty the behavior of S1 in the system.
2.3.4. Origin of the Node Fitness in the Interactome-1060
A question now arises: which structural property dominates the fitness of the interactome-1060? Certainly, each single protein-node has its own specific intrinsic properties, but which of these is the predominant one? We have considered many characteristics of the nodes (protein length, secondary structure, flexibility, intrinsic disorder) but one of them stands out above all. Protein-protein interactions within a cell are dynamic events which do not occur concurrently and in the same location. When molecules came together, they form complexes, where structural disorder of motifs at the interface often mediates transient interactions. This creates transient multi-state complexes that are characterized by dynamic assortments of subunits. These diverse transient combinations also facilitate occurring different conformational states entropically driven [108]. Thus, the intrinsic disorder is the most critical features associated with transient interactions. Weakly interacting proteins show a fast dynamic bound-unbound equilibrium, which also includes interactions that are triggered by an effector molecule and stabilized by a conformational change. The 66% of signaling proteins involved in cellular functions with strong temporal variation of activity show a high probability of involving transient interactions [109]. Temporary interactions wield a significant influence in determining the hub behaviors [110]. Many hub-proteins with variable co-expression partners show transient binding at different times. While with high co-expression partners, they develop stable complexes [111]. This highlights the importance of the intrinsic structural disorder in protein-protein networks, but even that it underlies connectivity and controls how hub proteins interact. Figure 3 shows the average distribution of the intrinsic disorder existing in the interactome-1060 interactions.
Figure 3.
- The plot shows the average distribution of the protein disorder of the associated proteins in interactome-1060. Pearson's r value: ‒0.1; Pearson's p-value: 0.0015; BP-R2: 0.062 (medium). The grey light line is the median. STRING computed disorder content from sequences. The measure of BP-R2 is based on checking how much the values of the specified trend property deviate from the mean. Its scale (from 0 to 1) follows a quadratic pattern and does not have a confidence measure associated with the BP-R2. As a result, STRING has included some thresholds and the value of 0.15 is medium.
Figure 3.
- The plot shows the average distribution of the protein disorder of the associated proteins in interactome-1060. Pearson's r value: ‒0.1; Pearson's p-value: 0.0015; BP-R2: 0.062 (medium). The grey light line is the median. STRING computed disorder content from sequences. The measure of BP-R2 is based on checking how much the values of the specified trend property deviate from the mean. Its scale (from 0 to 1) follows a quadratic pattern and does not have a confidence measure associated with the BP-R2. As a result, STRING has included some thresholds and the value of 0.15 is medium.
The plot shows that intrinsically disordered proteins, which have a disorder percentage greater than 30% and play a crucial role in interactions, account for about half of the total interactions. The disorder content is very large, but almost all proteins show disorder. However, disordered segments can give interactions even shorten. All high-ranked nodes have a disorder content between 20 and 40% [112]. STRING used the binned pseudo-R-squared (BP-R2), a measure developed by Lun et al. to quantify complex signaling relationships between two variables, to assess the goodness of the fit [113]. The idea is to capture relationships that may not have a high Pearson or classification correlation, but that show associations, which are non-trivial. The range between 0 and 70 degrees shows a concentration of proteins with a high intrinsic disorder content. This range contains many of the proteins with high fitness potential. Disorder-enriched hub and non-hub nodes show a higher number of links, also because of the higher number of targeting, catalytic and many types of PTM sites [114]. In conclusion, we can say that the protein disorder is prevalent among these proteins. This is the structural feature that dominates fitness, driving the connectivity within the interactome-1060.
2.3.5. Centralities Based Analysis of the Interactome-1060
Topological analyses applied to protein networks show that some parameters such as Connectivity degree (k), betweenness centrality (BC), closeness centrality (CC), eigenvector centrality (EC), and eccentricity are crucial parameters of nodes [115]. They are indicators of centrality because assigning rankings to nodes within the graph; they characterize the most important vertices. Each centrality measure assigns a centrality value to each node in a network and captures different aspects of what it means for a node to be important in that graph. High-ranking target search, identifying suitable nodes for characterization, is a critical step in annotating functional processes and understanding their molecular basis. The priority is to narrow down the most important nodes. After defining the broad interactome induced by S1 (interactome-1060) and calculating its properties, it is useful to identify hubs and bottlenecks [116].
We define as hubs the top 10% of the nodes in the high-confidence protein interactome based on their node degree (the number of interactions associated with a node). While we consider another top 10% of the nodes, ranked by betweenness centrality and closeness centrality (BCC), as bottlenecks. Betweenness centrality is an indicator of a node’s centrality in a network. It is equal to the number of the shortest paths from all vertices to all others that pass through that node. While Closeness centrality calculates the average distance of all the shortest paths between a node and every other node within a network. Thus, nodes with a high closeness score have the shortest distances to all other nodes. Betweenness and closeness are a way of detecting bottleneck nodes that can spread information through a graph, even if they do not always have very high degrees [117,118,119,120]. Eigenvector centrality of a graph is a measure of the influence of a node in a connected network. Relative scores are given to every node within the network, understanding that connections to nodes with higher scores have a more substantial influence on their own score than connections to nodes with lower scores. A high eigenvector score shows a node has connections to many nodes that themselves have a high score. The resulting information allows us to identify key nodes in terms of connectivity relevance in the interactome, thus suggesting very similar and super-imposable results to those of node degree [121,122]. We know that nodes with large k are central because they might correspond to disease-causing genes/proteins, whereas bottleneck nodes are vital since they serve as a crossroads in major signaling "highways" or overpass across these "highways". We focused on the hubs and bottlenecks that were central to our PPI network, identifying these key proteins and considering their sub-networks as the backbone of S1 induced topology [123,124].
The Figure 4 reveals the node distributions according to their centralities as calculated by Cytoscape. We report the protein names for comparing them with the results in Table 3.
Figure 4.
The figure shows the centrality distribution plots of interactome-1060: Closeness Centrality (top), Betweenness Centrality (middle) and Eigenvector (bottom). The Network Analyzer (version 4.5.0) on Cytoscape calculated the topological parameters to identify the node values [25,80,81]. The calculated distributions originate from the interactome.
Figure 4.
The figure shows the centrality distribution plots of interactome-1060: Closeness Centrality (top), Betweenness Centrality (middle) and Eigenvector (bottom). The Network Analyzer (version 4.5.0) on Cytoscape calculated the topological parameters to identify the node values [25,80,81]. The calculated distributions originate from the interactome.
We extracted the 26 nodes with the highest centrality values from each distribution. They represent the candidates for the analysis of the topological properties. Both Closeness and Betweenness select the nodes with the best features as bottlenecks. While the eigenvector distribution shows the protein nodes with the highest connectivity comparable to high-degree HUBs. We can find the 26 values for each centrality in the Excel file 1, sheet 3. In the comparisons between the various sets (Betweenness vs Closeness and Eigenvectors vs Degree), we selected only nodes in common between both compared sets. This resulted in fewer bottlenecks. But even though the total number of selected nodes is lower, we created very significant sets of HUBs and bottlenecks (Table 3).
Table 3.
High-ranked Hub and Bottleneck nodes of interactome-1060.
Table 3.
High-ranked Hub and Bottleneck nodes of interactome-1060.
HUB nodes |
Degree |
Bottleneck |
Degree |
RPS6 |
210 |
RPS27A |
235 |
RPS11 |
209 |
UBA52 |
213 |
RPS3A |
209 |
RACK1 |
149 |
RPS24 |
209 |
CD74 |
110 |
RPS9 |
209 |
MED1 |
107 |
RPS18 |
208 |
SRC |
101 |
RPS28 |
209 |
EEF1A1 |
88 |
RPS8 |
208 |
EGFR |
76 |
RPS19 |
207 |
ACTB |
65 |
RPS7 |
208 |
CD44 |
48 |
RPS23 |
200 |
STAT3 |
49 |
RPS16 |
190 |
CBL |
37 |
RPS3 |
174 |
|
|
RPS15 |
172 |
|
|
RPS5 |
189 |
|
|
FAU |
172 |
|
|
RPS13 |
184 |
|
|
RPS21 |
169 |
|
|
RPS17 |
169 |
|
|
RPS14 |
183 |
|
|
RPS27 |
181 |
|
|
We computed hub and bottleneck nodes based on relevant topological parameters, but there is no consensus in the literature on a defined threshold to identify how many nodes should be hubs or bottlenecks in a protein network, because the possible and used criteria are too numerous and sometimes arbitrary [124]. However, these nodes should represent the backbone of the basic connectivity that should favor a balanced architecture of the entire network with specific functional aspects. Therefore, we have gathered the selected HUB and Bottleneck nodes in a group of 33 nodes (3.1%). This group (Table 3), besides the topological characteristics, should also show evidence of reliable interactions, forcing the network into a hub and spoke organization. We expect that, for the topological functions they perform, they should closely connect to each other, because they all interact with the same dominant hub. In this regime hub and spoke configuration show characteristics of disassortative networks [66,67].
A look at the two plots in Figure 5 reveals that many of these high-ranked nodes (Hubs and Bottlenecks) concentrate in the nucleus and cytoplasmic system. This agrees with all activities related to viral translation. Both categories of nodes have significant biological importance, as they represent key connections and critical points of interaction. Bottleneck nodes possess a high centrality of intermediation, as they connect many parts of the network, influencing the information flow. The high scores between 4.5 and 5 justify these attributions.
Figure 5.
The graph shows the HUB-SPOKE pattern (left) generated by the selected nodes. The red color shows proteins involved in cytoplasmic translations (GO:0002181; strength: 2.07 and p-value: 6.81x10−41), while the blue one in gene expression (GO:0010467; strength: 0,89 and p-value: 4.00x10−18). STRING calculated the graph. Plots on the right show the distributions of nodes in the cellular compartments. These two classes of nodes operate in the cytosol and nucleus, some in both. Calculation performed by the Cytoscape.
Figure 5.
The graph shows the HUB-SPOKE pattern (left) generated by the selected nodes. The red color shows proteins involved in cytoplasmic translations (GO:0002181; strength: 2.07 and p-value: 6.81x10−41), while the blue one in gene expression (GO:0010467; strength: 0,89 and p-value: 4.00x10−18). STRING calculated the graph. Plots on the right show the distributions of nodes in the cellular compartments. These two classes of nodes operate in the cytosol and nucleus, some in both. Calculation performed by the Cytoscape.
The spoke-hub architectural pattern (for an operational explanation see:
https://cloud.google.com/architecture/deploy-hub-spoke-vpc-network-topology) expects the core system with high connectivity to comprise HUB nodes, while the well-connected bottleneck nodes are located outside. In our case (graph on the left), some bottleneck nodes (UBA52, MED1, RPS27A, RACK1, CD74, FAU, EF1A1) operate at the interface linking the remaining nodes (SRC, ACTB, STAT3, CBL, CD44, EGFR). In the graph, the colors identify the main overall functions they manage together. The red color shows cytoplasmic activities while the blue refers to overall nuclear ones. As we can note, all hub nodes operate in both compartments. Obviously, the graph highlights only the direct connectivity between these nodes because, in the complete network, each HUB, or Bottleneck node, manages many other “normal” nodes. White bottlenecks mediate many signaling activities. ACTB is involved in cytoskeletal control (GO: 0005925, focal adhesion, p-value 4.07✕10
‒21).
2.4.1. Justifications for a One-to-One Study
The principles governing immune responses operate at the organ-scale. Propagation of immune signaling between organs shows inter-organ mechanisms of protective immunity mediated by soluble and cellular factors [125] that transcend organ boundaries [126]. Cellular factors such as memory T cells can patrol organs and infected tissue [127,128]. Changes in tissue gene expression following vaccination have also highlighted immune processes that operate at the organ scale through a protective network [129]. Recent work shows that vaccination, like repeated infections, provides protection even in very distant tissues. [130,131]. Researchers have used this logic to study shared and tissue-specific expression patterns [132,133] and their correlation with disease [134,135]. All this suggests isolating the biological activities one by one, specific to the S1 protein. Both during infection and vaccination, both events have in common the encoded information (mRNA). Therefore, in both cases, the mRNA must use the same biosynthesizing nano-machines and the decoded protein, when acting alone, should take part in the same cellular processes present in the human host.
2.5.1. Reverse Engineering
Reverse engineering in biology applies an engineering concept, that of dismantling a process to understand it and discover the biological strategy. Thus, it is often used to discover the design principles of a biological system when the relationships between microscopic and higher-level processes are degenerate (many-to-many or one-to-many). It addresses the understanding of a complex system when the nonlinear relationships between the system's capabilities and its deep molecular mechanisms change. This suggests its usefulness in analyzing a complex functional system, faced with limited a priori knowledge of its "design principles" [136]. At first glance, "disassembling to reassemble" may seem like a reductionist approach to systems biology. However, data-intensive biological fields use reverse engineering approaches to recognize nonrandom connectivity patterns and identify functional capabilities of the overall network architecture [137]. This enables a topological analysis that abstracts from the context of network connectivity to identify functional capabilities. It is an approach to understanding how certain components are wired to create a functional whole. The search for these design principles allows us to know lower-level causal details and becomes robust when integrated with external experimental data tested in vivo and which can therefore biologically validate an interaction as real [138].
Our goal is to understand whether, in the same system but with changed organizational features, S1 performs similar operations. Although the molecular mechanisms and the specific functions that are activated depend on specific metabolic parameters, it may be possible to identify parameter spaces for which the same functions hold. In a broader discussion, we should consider that groups of viral proteins contribute to enhance the virulence of the virus by attacking single human proteins with multiple interactions, but some also do so alone. This should not be surprising because we have already observed it in the liver within the covid [10]. SARS-CoV-2 shows a broad tissue tropism, although of varying degrees, perhaps greater than what clinicians can appreciate through observation. This tropism is, however, the expression of the steps necessary to progress the viral infection, even in phenotypically different individuals and represents a strategic adaptation to the host. This means that, to be successful in replication, the action of the virus depends not only on its virulence, favored by Spike's interactivity but also on the ability to adapt the strategy of the viral proteome to the specific metabolic landscape it encounters in the host. Among the main variables that the virus encounters are age, sex, nutritional status, and previous individual pathologies.
In the excel file 3 we show the overall results of the reverse engineering analysis. We checked one by one, all 1060 nodes of the interactome in figure 1 against the 25,521 interactions collected in our SHPID database [10]. These interactions derive from individual proteins encoded by the SARS-CoV-2 proteome and those of the human proteome, as reported in BioGRID.
This file shows that there are many multiple interactions of S1 together with other viral proteins (sheet 3). We also found many viral proteins that interact individually in a one-to-one manner with specific human proteins (sheet 2). They could be pharmacological targets. While many human proteins are apparently not involved in any viral activities. This result confirms our previous observations on covid [10]. Probably, proteins not involved control metabolic processes that are beneficial for both the virus and the human host. The most interesting observation (sheet 2) is a set of 27 unique one-to-one interactions of S1 with human proteins (ACE2, AGTR1, AKT2, APOE, ASGR1, AVPR1B, C1QB, C1QC, CD46, CFH, CFP, CLEC4M, COP1, CR2, DPP4, ESR1, F10, FLT1, L12RB1, ITGB6, LYPLA2, MBL2, NID1, SDC1, SDC2, SNCA, TLR4). Through these proteins we will try to understand in which functional processes they involve, with which human proteins, and whether these interactions could represent a functional framework exclusive to S1, or to the infection.
The multiple interactions refer to the attacks conducted by groups of viral proteins, including S1, against single human proteins. The file (sheet 3) lists 148 human proteins attacked in this way. A careful observation shows that the number of viral proteins attacking single human proteins is often considerable. An example of all, the gene EIF2S1, Eukaryotic Translation Initiation Factor 2 Subunit Alpha, encodes the protein IF2A_Human. This is a protein of 315 aa, with a mixed alpha/beta structure. It is a member of the eIF2 complex that functions in the early stages of protein synthesis by forming a ternary complex with GTP and initiator tRNA. 19 viral proteins (nsp1, nsp3, nsp4, nsp5, nsp6, nsp13, nsp14, E, M, N, S, ORF10, ORF3a, ORF3b, ORF6, ORF7b, ORF7a, ORF8, ORF9b) attack this protein. Given its small size, like that of the ternary complex itself, it is impossible for there to be enough surface space for the interactions of 19 proteins concurrently. This means that the interactions, although all brief and momentary, occur at different times. However, without the time sequence, it is impossible to define the actual functional mechanism affected by these interactions (even without wanting to consider the “where”). On this basis, it once again seems useless to define an overall mechanism through chronologically undefined single interactions.
2.6.1. The Interactome-814
We used these twenty-seven proteins as functional seeds in the human proteome. Figure 6 shows the new interactome calculated by STRING.
Figure 6.
-Interactome of the 27 human proteins interacting one-to-one with S1. Number of nodes: 814; number of edges: 7409; average node degree: 15.9; avg. local clustering coefficient: 0.547; expected number of edges: 2285; PPI enrichment p-value: < 1.0x10‒16; 6 channels (without Text mining); confidence score 0.900; enrichment: 500 1st order + 500 2nd order proteins.
Figure 6.
-Interactome of the 27 human proteins interacting one-to-one with S1. Number of nodes: 814; number of edges: 7409; average node degree: 15.9; avg. local clustering coefficient: 0.547; expected number of edges: 2285; PPI enrichment p-value: < 1.0x10‒16; 6 channels (without Text mining); confidence score 0.900; enrichment: 500 1st order + 500 2nd order proteins.
This interactome (from now on “interactome-814”) comprises 814 nodes (Excel file 2, sheet 1). The first observation is that despite we added 1000 proteins for the enrichment, the system only accommodates 787 of them (814 - 27 = 787). This seems to reflect a low number of experimentally proven interactions. We can consider that STRING classifies only 21.46% of them as High or Highest (Excel file 2, sheet 2), which brings us back to the considerations made in Appendix A. Table 4 shows that this interactome too has a periphery rich in subgraphs but is on average less dense (0.22) and with a value of the average number of neighbours about 50% lower than the interactome-1060. Heterogeneity (1.042) suggests the tendency of this network to contain hub nodes, while the centralization value (0.138) still supports compactness even if the distance between two nodes (diameter) is lower but still high and supports the almost asymmetrical architecture we observe. In conclusion, we have an interactome with a global organization quite like the previous one, although smaller and less dense in terms of connectivity.
Table 4.
- Topological parameters of Interactome-814*.
Table 4.
- Topological parameters of Interactome-814*.
Number of nodes |
814 |
Number of edges |
7409 |
Average node degree |
15.9 |
Avg. local clustering coefficient |
0.547 |
Expected number of edges |
2285 |
PPI enrichment p-value |
<1.0✕10‒16
|
Confidence score |
0.900 |
Source channels |
6 |
Network diameter |
7 |
Network radius |
4 |
Characteristic path length |
3.189 |
Network heterogeneity |
1.042 |
Network density |
0.22 |
Network centralization |
0.138 |
Connected components |
1** |
Figure 7.
- Power law distribution of interactome-814. The distribution follows a free-scale distribution based on the power law. In the inset, we present the same nodes on a log–log scale, with the best fit of data shown in red. Slope is –0.374. We calculated the slope on the best fit line in the log-log inset.
Figure 7.
- Power law distribution of interactome-814. The distribution follows a free-scale distribution based on the power law. In the inset, we present the same nodes on a log–log scale, with the best fit of data shown in red. Slope is –0.374. We calculated the slope on the best fit line in the log-log inset.
Also, the interactome-814 shows a power law characteristic of scale-free networks (figure 7). Differently from the interactome-1060, the log-log distribution plot shows a fit with a good R2 value of 0.7528, so this log-log fit is the signature of a system well described by the power-law equation. Hence, interactome-814 should show a very balanced and linear overall growth, without distorting effects. Also, in this case, the exponent y is greater than 1, showing a central component that does not prevent peripheral modules. As for the two slopes, comparing them, they are both negative and not very different in value. However, the slope shows different growth rates, with the number of nodes increasing faster in 1060. Probably the two interactomes, although structurally similar, react differently to internal or external factors and this could be because of the presence of the greater heterogeneity found in 1060. Finally, all this suggests that, despite the considerable underlying biological complexity, the relationships between metabolic processes and population sizes of the interactomes seem to obey a relatively simple relationship, given by the power equation. This is a further fact that justifies the comparisons we are making on the two interactomes and indirectly also the search for the specific functional activities of the S1 protein.
In Figure 8, we show the centrality distributions of interactome-814. We reported the numerical values of the first 26 terms for each distribution in the Excel file 2, sheet 3. The same procedure adopted for interactome-1060 was used to assign the highest-ranking values. We can see in the betweenness centrality distribution that the upper range of the distribution is very wide, involving proteins with both a high degree and medium-low degree. What is striking is that some of them are also present in the centrality distribution of the eigenvectors. Since these are different topological properties, this, as we will see later, suggests mixed proteins (Hub/Bottleneck), a situation not present in the interactome-1060.
Figure 8.
The figure shows the centrality distribution plots of interactome-814: Closeness Centrality (top), Betweenness Centrality (middle) and Eigenvector (bottom). We calculated the topological parameters individually using the Network Analyzer (version 4.5.0) on Cytoscape to identify the node values [81].
Figure 8.
The figure shows the centrality distribution plots of interactome-814: Closeness Centrality (top), Betweenness Centrality (middle) and Eigenvector (bottom). We calculated the topological parameters individually using the Network Analyzer (version 4.5.0) on Cytoscape to identify the node values [81].
Table 5 reports the results showing the highest-ranking hub and bottleneck nodes. A comparison with Table 3 shows that although the architecture of the two interactomes may seem quite similar, the main proteins that underlie their structural and functional organization are totally different and behave differently. The nodes in Table 3, considered singly, perform only one activity, either as a hub or as a bottleneck. There are no mixed-activity nodes. We can consider those in Table 3 as pure hub and bottleneck nodes [139], while many of those in Table 5 show mixed activity.
Table 5.
High-ranked Hub and Bottleneck nodes of interactome-814.
Table 5.
High-ranked Hub and Bottleneck nodes of interactome-814.
HUB nodes |
Degree |
Bottleneck nodes |
Degree |
PIK3R1 |
121 |
AKT1 |
65 |
PIK3CA |
113 |
EGFR |
103 |
PIK3R2 |
114 |
ESR1 |
107 |
PIK3R3 |
113 |
MAPK1 |
113 |
PIK3CD |
108 |
MAPK3 |
130 |
PIK3CB |
108 |
PIK3CA |
129 |
SRC |
103 |
PIK3R1 |
128 |
AKT1 |
107 |
PRKACA |
112 |
MAPK1 |
112 |
PRKACB |
121 |
EGFR |
65 |
PRKACG |
109 |
MAPK3 |
109 |
PTK2 |
75 |
AKT3 |
73 |
RHOA |
49 |
AKT2 |
73 |
SRC |
65 |
ESR1 |
65 |
TP53 |
69 |
PLCG1 |
69 |
|
|
TP53 |
75 |
|
|
MAPK8 |
92 |
|
|
MAPK9 |
90 |
|
|
The functional coincidence between some hubs and bottlenecks in the interactome shows that these proteins not only cover many interactions but play a critical role in maintaining connectivity and stability in the network [140]. The coincidence also suggests that these proteins are fundamental for the functioning of the biological system and may represent key points for therapeutic interventions or functional analyses [139]. In fact, both categories of strategic positions in the network help to understand the robustness and vulnerability of the interactome, revealing potential regulatory mechanisms [141]. This allows us to consider seriously that the S1 subunit behaves differently when it interacts alone with the human proteome. To get a more reliable picture, we verified whether a hub-spoke scheme also exists in this case and what are the main allocations these proteins have in the cellular compartments. Figure 9 shows a hub-spoke scheme where, however, the central system is mixed because both pure hubs and bottlenecks man it.
Figure 9.
The Hub-Spoke organization of the Interactome-814. The lower central part of the graph (left) shows both pure Hub proteins and many mixed ones. The table on the right shows some of the most significant biological terms globally regulated by these high-ranking nodes. It is interesting to note how each of the nodes takes part in multiple biological functions simultaneously.
Figure 9.
The Hub-Spoke organization of the Interactome-814. The lower central part of the graph (left) shows both pure Hub proteins and many mixed ones. The table on the right shows some of the most significant biological terms globally regulated by these high-ranking nodes. It is interesting to note how each of the nodes takes part in multiple biological functions simultaneously.
The processes shown in the table are just some of the most relevant terms in which the nodes of the Hub/Spoke organization are globally involved. The graph shown is the structural backbone of the network. These activities, as well as many others not reported, support a deep involvement of S1 mainly in metabolic activities, even with worrying negative aspects (hsa05200 or HAS-199418). All processes show high Strength values, which suggest coordinated and active processes, and organizationally well-supported even at the gene expression level. It is worth noting that the graph includes 23 nodes among the highest-level ones and that 22 of them are involved in well-supported and statistically significant negative processes.
Figure 10 shows the four most significant distributions relative to the cellular compartments (cytosol and nucleus) and tissues (nervous system and blood) populated by the proteins of the interactome-814. The upper parts of the distributions exhibit dense population between the values 4 and 5. This shows a high functional activity of the proteins that populate it. The extent of involvement of high-ranking proteins can be determined by analyzing the distribution along the abscissa (degree). In this interactome, the cytosol, and nucleus stand out as the most involved and populated cellular compartments. However, the extracellular area and membrane level also exhibit intense metabolic activity (data not shown). Among the tissues, the nervous system is massively involved by proteins that include a large part of the high-ranking ones.
Figure 10.
Distribution of interactome-814 proteins in cellular compartments (nucleus, and cytosol) and tissues (blood, and nervous system). Calculations performed by Cytoscape.
Figure 10.
Distribution of interactome-814 proteins in cellular compartments (nucleus, and cytosol) and tissues (blood, and nervous system). Calculations performed by Cytoscape.
In summary, the two interactomes, despite their seemingly similar structure, perform distinct functions that are currently only broadly defined. It becomes important to focus on the functional activity to understand if, and how much, they differ from the point of view of metabolic purposes and, above all, which are the genes that oversee these processes. Are they the same genes or are they different genes? What is surprising is that interactome-814, despite having a lower total number of nodes than interactome-1060, controls 7,120 terms and 40% more functions based on Gene Ontology terms. (See Table 6).
Table 6.
Functional Processes activated in the human genome by the interactome-814.
Table 6.
Functional Processes activated in the human genome by the interactome-814.
Biological Process |
Terms significantly enriched |
|
Biological Process (Gene Ontology) |
2557 terms |
Molecular Function (Gene Ontology) |
321 terms |
Cellular Component (Gene Ontology) |
231 terms |
Reference publications (PubMed) |
>10,000 publications |
Local network cluster (STRING) |
246 clusters |
KEGG Pathways |
213 pathways |
Reactome Pathways |
828 pathways |
WikiPathways |
453 pathways |
Disease-gene associations (DISEASES) |
222 diseases |
Tissue expression (TISSUES) |
223 tissues |
Subcellular localization (COMPARTMENTS) |
218 compartments |
Human Phenotype (Monarch) |
1196 phenotypes |
Annotated Keywords (UniProt) |
124 keywords |
Protein Domains (Pfam) |
14 domains |
Protein Domains and Features (InterPro) |
222 domains |
Protein Domains (SMART) |
52 domains |
All enriched terms (without PubMed) |
7120 enriched terms |
In Barabasi-Albert network models, enrichment arises from a network growth process governed by the preferential attachment of nodes. The same protein can exert different functions by binding to different partners. A fundamental question is to understand how the opportunistic choices of individual nodes shape the properties of the global network. Precisely identifying these influential nodes is a challenging and still understudied task. We also consider that nodes are biological agents and links represent their functional interactions, which can also be modeled as cooperative activities. Nodes, taking part in an ever-increasing number of molecular processes, can change their local behavior or topology, maximizing their cooperative activity [142,143].
How does a protein select among a multitude of potential binding partners within a cell, expanding its functional repertoire? An adequate response should consider the location and translation rate of messenger RNA (mRNA), as both factors can cause spatial regulation of protein synthesis, affecting local protein concentrations and interactions [144,145]. The rate of translation elongation can indeed influence protein folding and, subsequently, its interactions with other proteins [146,147]. Slow translation can allow more time for co-translational folding and interaction with certain partners, while rapid translation might favor interactions with different proteins or lead to misfolding [148,149]. However, additional considerations can also come from other types of comparisons of the two interactomes.
2.7.1. Data Merging
The two interactomes, 1060 and 814, although induced by the same viral protein, appear to operate in different metabolic contexts. Characterizing the behavior of these two networks is essential to understand the complexity of S1 action [150]. The differences appear clear if we compare the set of GO processes controlling each interactome. The enrichment of interactome-814 showed 7,120 terms in 15 categories. While the interactome-1060 shows 4,989 terms in 15 categories. The difference in terms is 1.42-fold, but for ontological terms is 40% and the three Ontologies reflect the functions. Size and reliability of the datasets under study, the scientific design, and phenotype specificity affect the identification of critical nodes and functional processes in any system. We standardized these variables by making the methodological procedures as similar as possible and, most importantly, using only experimental data and selecting only those with the highest reliability. We considered the topological properties of nodes and evaluated their functional roles based on their ability to transmit information within and between modules in the network. Using Gene Ontology for genomic functional annotation is crucial, as it can reveal important biological information. Gene Ontology (GO) is divided into three aspects: molecular function, cellular component, and biological process [151]. But it has redundancy problems when analyzing them together, especially because of gene overlaps. The redundancy in GO annotations can complicate the interpretation of biological data [152]. Therefore, the analysis of a single ontology, such as Biological Processes, which are also the most many and all-encompassing, can be a useful strategy to limit the redundancy and improve the clarity and significance of the results [153,154].
By comparing the Biological Processes (GO) of the two interactomes, we still highlighted the large functional differences already noted. 2,557 processes for the interactome-814 versus 1,430 processes for the interactome-1060, is 44.1% more. A closer look at the two interactomes (see Excel file 1, sheet 4 and Excel file 2, sheet 4) shows that many functions are similar, while others appear specific to each of them. The same happens for many of the nodes involved. In fact, some of them appear many times in different biological processes associated with the same interactome. All this suggests an important and central role of these genes in regulating some cellular functions related to covid [155], but it also raises questions we cannot yet fully answer today. For example, if the same gene appears in dozens of different biological processes, does it do so simultaneously or over a long-time horizon? The analysis of cellular systems requires the coordination of large numbers of events, but identifying the temporal cues underlying interactions is the critical part of understanding cellular functions. With current knowledge, we could have a variety of interpretations, many probably distorted [156]. This has led us to investigate the overall behavior of biological processes, rather than wanting to find the gold-process at all costs.
The presence of multiple interactions within the interactome shows a complex network of gene regulation, in which some genes can directly or indirectly influence a myriad of biological processes. But, when we say many genes and a "myriad of biological processes", we need to know what we are talking about in quantitative terms. To our knowledge, no study related to SARS-CoV-2 has ever made such an assessment. To understand the similarity and dissimilarity of functions and the genes that support them, we used an analysis borrowed from marketing methods to compare the two data sets represented by the Biological Processes (GO). We compared the two interactomes through a Data Merging (details in Methods), combining the two large biological data sets into one (see Excel file 4, sheets 1 and 2). Data Merging is used to evaluate interactions parameters, append observations, and find repetitions. Therefore, the logic we used was that to distinguish common processes (coupled processes) from single processes (uncoupled processes) of each interactome. The purpose of merging is to optimize collecting all the data into a single set, maximizing the completeness with which critical information can be extracted and analyzed. Table 7 illustrates the general picture that emerges from the merging of the two data sets, both containing common processes, but also specific to one or the other data set.
Table 7.
- Data Merging between Biological Processes (GO) of interactomes 1060 and 814.
Table 7.
- Data Merging between Biological Processes (GO) of interactomes 1060 and 814.
|
Number of Biological Processes (GO) (%) |
Redundant genes (%) *
|
Coding genes |
Average genes per single process. |
Genes found >100 times |
Merging of 1060+814 (after pruning) **
|
2837 (total)
|
68,003 (total)
|
--- |
23.97 |
----- |
Coupled processes in the merging of 1060+814 |
554 (39) ***
|
24,301 (35.8)
|
944 |
21.9 |
ABL1, AGT, AKT1, APOE, BCL2, BTK, CD28, EGFR, FYN, HLA, HRAS, IL12A, IL12B, IL12RB1, IL23A, JAK2, KDR, KIT, LYN, MAPK1, RHOA, SRC, SYK, THBS1, TICAM1, TLR4, TNF, TYK2, ZAP70. |
Uncoupled processes in 814 |
1515 (53) |
39,691 (58.3) |
771 |
26.19 |
ADA, ADCY8, ADRA1A, ADRA2A, AGT, AGTR2, AKT1, AKT2, APOE, APP, AR, ASPH, ATF2, ATF4, ATP2B4, AVP, AVPR, BAD, BAK1, BAX, BCL2, CALM1, CTNNB1, CYBA, DLG1, EDNRA, EGFR, EP300, FOS, FOXO1, FOXO3, FYN, GNAI2, GSK3A, GSK3B, HIF1A, HSP90AA1, HSP90AB1, HSPA5, IGF1R, IL12B, IL2, INSR, IRAK1, ITGB1, JAK2, JUN, KCNE1, KCNQ1, KDR, KIT, LYN, MALT1, MAP2K1, MAPK1, MAPK14, MAPK3, MAPK8, MED1, MMP9, MTOR, MYD88, NFKB1, NKX3-1, NOS1, PODGFRA, PIK3CA, PIK3CG, PLCG2, PPARA, PPARG, PPP3CA, PRKCD, PTEN, PTK2B, PTPN2, RELA, RHOA, RIPK1, RIPK2, RACK1, RPTOR, SLC8A1, SMAD3, SNCA, SRC, STAT3, SYK, TGFB1, THBS1, TIRAP, TLR2, TLR4, TNF, TP53, |
Uncoupled processes in 1060 |
214 (8) |
4,011 (5.9) |
701 |
18.74 |
Family EIF, Eukaryotic initiation factors gene family, (230), histones (295), family NDUF (352), family RPL (516), family RPS (411). **** |
Table 7 shows how the data-merging reveals thousands of genes with widespread gene redundancy, but also many uncoupled processes. These results specifically show that the activities exerted by the S1 subunit alone in its one-to-one relationships (in 814) have a relevant functional incidence (53%). However, the large number of highly represented genes in the same processes also means that multiple genes will have to appear multiple times in biological processes associated with the same interactome. An average value of over twenty genes per process shows how difficult it is to single out a single signaling pathway, or even a metabolic process, and assign genes to it.
The observed differences in gene composition suggest that gene expression and its involvement can vary significantly depending on the specific context, such as different tissue types, conditions, or stages of development. This can cause different genes being highlighted, even at different times, within the same larger biological process. We should not overlook the perception of the many ways in which 68,003 genes can be organized into 2,837 different processes. In fact, an extremely high number of processes can be assigned to about twenty genes. The overall number of processes is 232837, while S1 specifically comprises 261515. This is an astronomical number of combinations, which makes it clear why adequate and correct experimental data, and their control, are necessary to reduce the combinations to a few when studying specific functional processes in any design context. As an illustration, when examining IL12A, involved in coupled processes, or RACK1, involved in uncoupled processes of 814, (Table 7), they exert a wide range of biological functions, so many that each of them is involved in over 100 processes. Therefore, how can we confidently ascribe the precise biological pathway in which each of these proteins takes part, considering their abundance of over 100 occurrences within the interactomes under investigation?
Studies on HeLa cells have revealed that protein expression levels exceeding 90% are consistent with the average level of protein expression [157]. This shows that there is ample evidence to support an excess of protein copies, even at the level of gene expression, encompassing a significant portion of transcripts that encode functional proteins. But this ensures the efficient functioning of the processes in which these proteins are involved [157,158,159]. Protein abundance can be determined by many factors, such as transcription, translation, or RNA/protein decay [160]. Therefore, these factors can combine to produce a certain expression value. The load balance between transcription and translation regulates gene expression necessary to optimize cellular fitness [161]. Low expression of essential proteins slows growth [162], but even generalized overexpression of proteins slows growth because it increases metabolic load [163] and energetic costs. Today, we can only say that the implications of over-representations of genes in an interactome can be multiple and each hypothesis influences the understanding of disease and cellular interactions [164]. Correct regulation of genes in space is necessary for proper function.
These claims may raise many questions, but there is no clear evidence to support any hypothesis or claims made about this matter. Despite technological advances in high-throughput sequencing, our ability to draw functional conclusions from expression data is lagging and mostly qualitative [165,166,167]. The cell organizes its biochemistry in space by forming distinct chemical compartments in which membranes are separating barriers. Achieving the ability to differentiate the functions of cells within a multicellular tissue requires standardizing spatial transcriptomics data and correlating it with cellular mappings using bioinformatics systems. This will enable the identification of various subpopulations with their distinct transcriptional profiles [168,169]. In addition, when we evaluate protein-protein interactions present in an interactome, we realize that, despite the integrations between different sources, they are far from complete in experimental terms [170,171]. This can lead to gaps in real physical characterization and certainty of the interaction that is reflected in distortions of functional knowledge in GO processes. While superimposition between gene sets can cause low specificity in over-representation analysis, affecting the results and conclusions. Thus, over-representation (also called enrichment analysis) in genomic analysis plays a crucial role in several aspects. It works by identifying pathways or gene/protein sets that have a higher overlap with a known gene/protein set of functional interest than expected by chance. For example, it helps identify significant biological pathways associated with certain conditions or diseases by revealing how overrepresented genes/proteins interconnect. The interconnectivity of genes, i.e., their membership in functional communities, enables us to unravel complex biological mechanisms that we cannot resolve by analyzing some individual processes or signaling pathways. In summary, overrepresentation is fundamental to interpret genomic data but when these are overabundant, complex, with high protein redundancies, as we find them here, it may be more appropriate to identify sets of genes that are highly interconnected and that exert specific functional activities in common. This way, we should have a more precise vision of the functional strategies in an interactome. Therefore, we eliminated redundancies from the three gene sets by isolating the single copy of each coding gene. We got three sets of coding genes: 944 genes for the coupled processes of the interactomes 1060+814, 689 for the uncoupled processes of the interactome-1060, and 771 for the uncoupled processes of the interactome-814. We performed a clustering analysis of each of the three sets of their decoded products (Excel file 5, sheets 1, 2 and 3). The sets encompass proteins related to common and interconnected functional processes (1060+814), proteins specifically involved in the one-to-one activity of S1 (814), and proteins derived from the 1060-interactome that do not fall into the sets.
2.8.1. Clustering Analysis
We conducted this analysis on the three sets of coding genes to get an overall picture of the activities exerted by each set. The Excel file 5 also reports the three sets of genes involved. Table 8 shows the overall results.
Table 8.
Clustering Analysis of coding genes from Data-merging.
Table 8.
Clustering Analysis of coding genes from Data-merging.
1-CLUSTERS OF UNCOUPLED FUNCTIONS OF INTERACTOME-1060 |
Cluster No. |
Primary description |
GO-term |
p-value |
Gene count * |
1 |
Cytoplasmic translation |
GO:0002181 |
4.83 × 10−83
|
266 |
2 |
Focal adhesion |
GO:0005925 |
7.61 × 10−48
|
189 |
3 |
Aerobic electron transport chain |
GO:0019646 |
1.49 × 10−47
|
75 |
4 |
DNA replication-dependent chromatin assembly |
GO:0006335 |
6.67 × 10−19
|
44 |
5 |
Antigen processing and presentation |
GO:0019882 |
6.67 × 10−16
|
33 |
6 |
Complement activation, classical pathway |
GO:0006958 |
1.67 × 10−11
|
23 |
7 |
COPII vesicle coat |
GO:0030127 |
2.46 × 10−12
|
20 |
8 |
Activation of phospholipase C activity |
GO:0007202 |
3.30 × 10−06
|
18 |
9 |
COPI vesicle coat |
GO:0030126 |
1.90 × 10−09
|
11 |
10 |
Cholesterol metabolism |
hsa04979 |
2.70 × 10−04
|
10 |
Cluster No. |
Secondary description |
GO-term |
p-value |
Gene count |
1 |
Formation of a pool of free 40S subunits |
HAS-72689 |
7.09 × 10−91
|
- |
3 |
Respiratory chain complex |
GO:0098803 |
7.29 × 10−52
|
- |
4 |
CENP-A containing nucleosome |
GO:0043505 |
5.51 × 10−15
|
- |
6 |
Complement and coagulation cascades |
hsa04610 |
4.06 × 10−09
|
- |
8 |
G alpha (q) signaling events |
HAS-418597 |
1.11 × 10−03
|
- |
10 |
Plasma lipoprotein particle clearance |
GO:0034381 |
5.60 × 10−03
|
- |
Cluster No. |
Tertiary description |
GO-term |
p-value |
Gene count |
1 |
Ribosome |
GO:0005848 |
2.08 × 10−79
|
- |
________________________________________________________________________________________________________ |
2—CLUSTERS OF COUPLED FUNCTIONS OF INTERACTOMES-1060+814 |
Cluster No. |
Primary description |
GO-term |
p-value |
Gene count |
1 |
Positive regulation of transferase activity |
GO:0051347 |
2.76 × 10−63
|
409 |
2 |
Focal adhesion |
GO:0005925 |
5.66 × 10−44
|
232 |
3 |
ECM-receptor interaction |
hsa04512 |
9.88 × 10−36
|
79 |
4 |
Long-term potentiation |
HAS-9620244 |
7.01 × 10−06
|
54 |
5 |
Rho protein signal transduction |
GO:00072666 |
9.12 × 10−08
|
43 |
6 |
Formation of Fibrin Clot (Clotting Cascade) |
CL:18784 |
1.09 × 10−06
|
37 |
7 |
Antigen processing and presentation |
GO:0019882 |
7.05 × 10−13
|
35 |
8 |
Complement activation |
GO:006956 |
1.33 × 10−18
|
33 |
9 |
Cholesterol metabolism |
hsa04979 |
1.80 × 10−03
|
13 |
10 |
Renin-angiotensin system |
hsa4614 |
2.09 × 10−03
|
9 |
Cluster No. |
Secondary description |
GO-term |
p-value |
Gene count |
1 |
Cellular responses to stress |
|
7.56 × 10−11
|
- |
2 |
Mixed, incl. Constitutive Signaling by Aberrant PI3K in Cancer, and FCERI mediated Ca+2 mobilization |
CL:17328 |
2.28 × 10−34
|
- |
3 |
Protein complex involved in cell adhesion |
GOCC:0098636 |
1.09 × 10−27
|
- |
4 |
Calmodulin-binding |
KW.0112 |
5.05 × 10−15
|
- |
5 |
G alpha (12/13) signaling events |
HAS-416482 |
1.69× 10−09
|
- |
6 |
Blood coagulation |
GO:0007596 |
7.29 × 10−24
|
- |
9 |
Regulation of plasma lipoprotein particle levels |
GO:0097006 |
2.16 × 10−05
|
- |
Cluster No. |
Tertiary description |
GO-term |
p-value |
Gene count |
1 |
Protein kinase binding |
GO:0019901 |
6.94 × 10−74
|
- |
5 |
Mixed, incl. Sema4D in semaphorin signaling, and ARHGEF1-like, PH domain. |
CL:17973 |
1.765× 10−06
|
- |
9 |
Protein-lipid complex |
GO:0032994 |
6946 × 10−74
|
- |
|
|
|
|
|
3-CLUSTERS OF UNCOUPLED FUNCTIONS OF INTERACTOME-814 |
Cluster No. |
Primary description |
GO-term |
p-value |
Gene count |
1 |
Hepatitis B |
hsa055161 |
4.98 × 10−73
|
259 |
2 |
mTOR signaling pathway |
hsa04150 |
2.05 × 10−36
|
139 |
3 |
Fc gamma R-mediated phagocytosis |
hsa04555 |
6.62 × 10−32
|
113 |
4 |
Long-term depression |
hsa04730 |
1.72 × 10−29
|
72 |
5 |
Blood vessels diameter maintenance |
GO:0097746 |
3.61 × 10−13
|
61 |
6 |
ECM-receptor interaction |
hsa04512 |
9.96 × 10−24
|
56 |
7 |
Complement activation |
GO:0006956 |
3.73 × 10−18
|
32 |
8 |
Renin-angiotensin system |
hsa04614 |
1.10 × 10−04
|
14 |
9 |
Glycerophospholipid metabolism |
Hsa00564 |
2.71 × 10−08
|
13 |
10 |
Plasma lipoprotein particle remodeling |
GO:0034369 |
9.94 × 10−05
|
12 |
Cluster No. |
Secondary description |
GO-term |
p-value |
Gene count |
3 |
Constitutive Signaling by Aberrant PI3K in Cancer |
|
1.12 × 10−33
|
- |
4 |
Calmodulin binding |
GO:0005516 |
1.11 × 10−21
|
- |
5 |
Mixed, incl. Heterotrimeric G-protein complex, and Signaling transduction inhibitor |
CL24307 |
6.90 × 10−13
|
- |
6 |
Cell adhesion mediated by integrin |
GO:0033627 |
2.32 × 10−14
|
- |
7 |
Initial triggering of complement |
HAS-166663 |
5.26 × 10−12
|
- |
8 |
Dipeptidyl-peptidase activity, and Meprin A complex |
CL31769 |
6.08 × 10−03
|
- |
10 |
Cholesterol metabolism |
hsa04979 |
1.98 × 10−49
|
|
Cluster No. |
Tertiary description |
GO-term |
p-value |
Gene count |
3 |
GPVI-mediated activation cascade, and SH2 domain superfamily |
CL:17470 |
1.53 × 10−27
|
- |
5 |
Vascular smooth muscle contraction |
hsa04270 |
8.57 × 10−39
|
- |
6 |
Integrin |
KW-0401 |
1.88 × 10−16
|
- |
10 |
Protein-lipid complex |
GO:0032994 |
5.40 × 10−03
|
- |
Although we also reported data on uncoupled functions of the interactome-1060 and those coupled via the merging protocol, our analysis currently focuses only on one-to-one interactions of S1, but we will discuss uncorrelated data later to provide a broader perspective.
The list of biological processes and pathways provided by the clustering analysis (Table 8) reflects a complex interplay of metabolic activities influenced by the one-to-one interaction of the S1 subunit of the SARS-CoV-2 Spike protein. Many of these activities are central to the body's response to infection, immune regulation, and cell signaling, and can be disrupted during both viral infection and vaccination. No one can rule this out. The clustering results cover broad macroscopic areas of activity: 1. Immune system activation and regulation; 2. Vascular and cardiovascular implications; 3. Metabolic processes; 4. Cell signaling and structural integrity; and 5. Neural and cognitive processes.
2.8.1.1. The liver aspects
The appearance of the Hepatitis B pathway (hsa055161) is unexpected in SARS-CoV-2. The liver is one of the most affected organs by covid, and the increase in liver enzymes is the most common symptom [172]. There appears to be a correlation between the severity of the disease and older patients with other morbidities. Chronic HBV infection can lead to metabolic syndrome and liver dysfunction [173]. The pathways involved in liver metabolism may intersect with systemic responses to SARS-CoV-2, especially in patients with pre-existing liver conditions. Metabolic dysfunction (MASH) is common in Western countries and proceeds through a slow progression of inflammation and fibrosis, which is associated with the imbalance of lipid metabolism and insulin resistance, all components also common to COVID-19. This can certainly exacerbate liver effects in chronic patients. The virus mainly infects cholangiocytes with elevated levels of IL1, TNFA, MCP1, all potential factors that can induce the development of MASH with progression to advanced chronic states, but we cannot exclude possible cancerous states [174]. So, the appearance of the Hepatitis B path may come from shared immune mechanisms or pathways between SARS-CoV-2 and Hepatitis B virus (HBV), possibly relating to immune evasion strategies or overlapping receptor usage.
We could consider potential explanations: A) Cross-reactivity: The immune response triggered by the S1 subunit may have shared components or epitopes with the Hepatitis B virus, resulting in a cross-reactive immune response that affects Hepatitis B-related pathways as well [175]. B) It is conceivable that this could be a statistical anomaly or data noise, suggesting an indirect association not directly caused by the S1 subunit, but reflecting shared cellular machinery or immune pathways. C) Certain signaling pathways that are activated during viral infections, such as mTOR or immune-related pathways, play a role in the response to different viral infections, including Hepatitis B, resulting in concurrent pathway activation [176,177]. Both viruses induce strong inflammatory responses, resulting in the activation of similar pathways in host cells and the initiation of shared biological processes. The activation of signaling pathways, such as the mTOR pathway, in response to viral infection, serves as an illustrative example of a common theme. D) The metabolic alterations induced in host cells by both viruses facilitate the promotion of viral replication. As an example, HBV modifies lipid metabolism, a potential pathway that SARS-CoV-2 also affects via its protein. The overlapping biological processes highlight the intricate interplay between viral infections, immune responses, and cell metabolism, even though the S1 subunit of SARS-CoV-2 and hepatitis B might not explicitly link to direct metabolic activities. Understanding these connections can help explain the broader implications of viral infections on host health and the development of vaccines. Further research would be essential to clarify the specifics of these relationships. However, considering the results of the data-merging analysis, all this seems to be the effect of the huge number of genes involved, and the possibility of innumerable interactions with the same groups of overlapping molecules, to which we must add the scarcity of experimental data, all factors that can mislead even advanced computing systems.
2.8.1.2. Vascular aspects
The “Renin-angiotensin system” is crucial for blood pressure regulation and fluid balance [178], and its involvement may explain some of the cardiovascular manifestations seen in COVID-19. In chronic liver disease, alterations in this system can exacerbate portal hypertension and fluid retention. MTOR, the Renin-angiotensin system, “Blood vessels diameter maintenance” and “Vascular smooth muscle connection” are closely connected and regulated through the calcium signaling path by “Calmodulin Binding”. This latter influences also the action of both “FC gamma R-mediated phagocytosis” and “Cytoskeleton regulation”, driven by “Integrin” and “Integrin mediated cell adhesion” [179]. Immune cells, such as macrophages, use calcium signaling [180] to engulf and eliminate infected cells, including those who would be affected by hepatitis B. Integrins mediate cell-cell and cell-ECM interactions, influencing cell migration and signaling [181]. In liver injury, integrin signaling can affect hepatocyte survival and regeneration [181,182]. During infection or immune response, disruptions in lipid metabolism, such as Glycerophospholipid Metabolism (hsa00564) and Cholesterol Metabolism, may occur, potentially causing alterations in lipid profiles and contributing to the hypercoagulable states observed in COVID-19.
2.8.1.3. Cumulative effects may originate cancerous involvement
Certainly, the cumulative effects of chronic inflammation, metabolic dysfunction, and abnormal signaling pathways can increase the risk of hepatocellular carcinoma in individuals with underlying liver disease. However, we should also consider that these cumulative effects may promote oncogenesis. In this context, the mTOR pathway is crucial in regulating cell growth, proliferation, and survival, but it was often associated with cancer [183]. Molecular changes, such as mutations in oncogenes and tumor suppressor genes, can further drive any cancer development. Even in the other two clustering analyses, we can see connections with possible cancer progression, especially in terms of genomic instability, increased proliferation, immune evasion, and metastasis. Dysregulated kinase, such as PI3K and ECM-receptor, may support cancer progression [184,185]. Ribosome biogenesis and chromatin assembly might also lead to uncontrolled cell growth [186,187]. Targeted projects are necessary in these areas to get concrete answers and reveal significant signals of cancerous evolutions. Here, we only show that cancer evolution could be possible because the specific processes are active and in common.
2.8.1.4. Neural effects
The neural and cognitive processes are another important point. LTD, “Long-term Depression” and “Calmodulin Binding” are involved in neural signaling and plasticity [188], suggesting potential effects on neurological and/or cognitive functions, which aligns with reports of neurological complications observed in COVID-19 patients [189,190,191]. The action of both “FC gamma R-mediated phagocytosis” and “Cytoskeleton regulation” [192], driven by “Integrin” and “Integrin mediated cell adhesion” can affect LTD [193]. We have found LTD (hsa04730, p-value: 7.21x10‒92) connected to “LTP, Long-term Potentiation” (hsa04720, p-value: 2.16 x 10-32) with many genes in common (see Figure 11).
Figure 11.
Relationships between LTP and LTD processes. These two molecular processes show many nodes in common. LTP (in red), Long-term potentiation (hsa04720), 18 nodes involved, strength 2,16 and p-value: 3.50x10-32. LTD (in blue), Long-term depression (hsa04730), 39 nodes involved, strength 2.52, and p-value: 7.21x10‒92. LTD is a process involving a decrease in the synaptic strength with multiple signal transduction pathways involved. While LTP is long-lasting increase in synaptic efficacy. The high Strength values show that many proteins physiologically support the involvement of multiple signal transduction pathways in both processes.
Figure 11.
Relationships between LTP and LTD processes. These two molecular processes show many nodes in common. LTP (in red), Long-term potentiation (hsa04720), 18 nodes involved, strength 2,16 and p-value: 3.50x10-32. LTD (in blue), Long-term depression (hsa04730), 39 nodes involved, strength 2.52, and p-value: 7.21x10‒92. LTD is a process involving a decrease in the synaptic strength with multiple signal transduction pathways involved. While LTP is long-lasting increase in synaptic efficacy. The high Strength values show that many proteins physiologically support the involvement of multiple signal transduction pathways in both processes.
The brain’s actions involve the participation and connection of both molecular processes. But there is also a potential link between these neurological processes and the S1 protein that could provide some clues about the molecular basis of the neurological impact of covid. PRKCG, MAPK1, BRAF, KRAS, and ITPR1 are part of key signaling pathways like the MAPK/ERK pathway, which is often linked to cellular stress responses, inflammation, and apoptosis. Various COVID-19-related pathologies, particularly those affecting the immune response and inflammation in different tissues, including the brain, implicate them especially in the MAPK/ERK pathway, which is often linked to cellular stress responses, inflammation, and apoptosis [194,195]. While NOS1 (nitric oxide synthase 1) is involved in producing nitric oxide, a molecule with widespread roles in neurotransmission and vasodilation. Researchers have implicated dysregulation of nitric oxide in COVID-19, particularly in relation to endothelial dysfunction, which can also affect the brain [196]. Genes of phosphoprotein phosphatase family, like PPP2CA, PPP2CB, and PRKG1 are involved in signaling cascades related to protein phosphorylation, that finely tune platelet aggregation [197] but their dysregulation could contribute to the virus's ability to manipulate cellular environments [198]. In addition, these genes have connections to inflammation, oxidative stress, and synaptic plasticity. Processes that are frequently modified during viral infections and potentially contribute to the long-term neurological consequences of COVID-19. PRKCG (Protein kinase C gamma) and MAPK1 have been shown to modulate insulin signaling and glucose uptake [199], also in the brain [200]. Disruptions in insulin and glucose metabolism pathways could contribute to neurological symptoms, including brain fog and fatigue reported in Long COVID, as these pathways closely tie to cognitive function. However, a very pertinent observation is the altered glucose metabolism in the brain reported by two French research groups [201,202]. In terms of glucose metabolism, genes like RAF1 and MAPK1 regulate metabolic homeostasis, including their effects on glucose uptake and insulin sensitivity.
Genes like GNAI2, GNAI3 (G-protein subunits) are part of G-protein-coupled receptor (GPCR) signaling pathways, which are directly involved in neurotransmitter systems, including serotonin signaling [203]. Serotonin regulation in the brain is crucial for mood, cognition, and overall neurological function. Dysregulation in this pathway could contribute to both mood disorders and cognitive symptoms seen in Long COVID patients.
To our knowledge, these results show the first molecular evidence that COVID-19 may affect brain metabolism, because of the involvement of these genes in critical brain functions, synaptic plasticity, and metabolic pathways. They potentially contribute to neuroinflammation states and energy dysregulation, affecting cognitive performance. However, further experimental and computational work should merge these links to reveal new therapeutic targets.