1. Introduction
Biological macromolecules, proteins, DNA, and RNAs, perform their function by adopting a particular 3D structure and being involved in a set of interactions. For many proteins, excluding intrinsically disordered proteins (IDPs), the correctly folded 3D structure is needed to prevent them from protease degradation and to form the desired catalytic set of residues, binding interface, and other functionally important structural features [
1,
2]. The assessment of the stability of such a 3D structure is done
via a thermodynamic quantity called folding free energy,
i.e., the difference between folded free and unfolded free energies (ΔG
folding). Another important process is the binding of biological macromolecules at which they adopt particular 3D complexes, including the cases of IDPs which upon the binding form a well-defined 3D structure [
3,
4]. Similarly, as above, the ability of macromolecules to form a macromolecular complex is assessed
via binding free energy (ΔG
binding),
i.e., the difference between the free energy of bound and unbound states. Thus, because of their importance for the biological function of macromolecules, the ΔG
folding and ΔG
binding were extensively investigated experimentally and many methods for predicting them were developed [
5,
6,
7,
8,
9,
10].
The vast majority of the experimental works were done to assess the impact of a given residue on either ΔG
folding or ΔG
binding, involving the substitution of wild-type residue to alanine (alanine scanning) [
11,
12,
13]. This raises the question about the balance of investigator-initiated mutations versus mutations seen in nature,
i.e., in the human population which are single nucleotide variants (SNVs). It should be mentioned that mutations and SNVs are both, types of genetic variations that can occur in the DNA sequence (this article focuses on missense mutations,
i.e., mutations that result in a change of the amino acid sequence of the corresponding protein). However, there is a subtle difference between the two terms, since a mutation is a broader term that refers to any change in the DNA sequence that is different from the wild type or the reference sequence, while SNV is a specific type of mutation that involves the substitution of a single nucleotide (A, T, C, or G) at a specific position in the DNA sequence. Thus, SNVs are a type of mutation, but not all mutations are SNVs. In this article, we will provide an assessment of the distribution of SNVs and non-SNVs and the corresponding free energy changes reported in the most popular databases.
Here we briefly outline some of the popular databases of experimentally measured thermodynamic quantities related to protein stability, protein-DNA interaction and protein-protein binding often used by researchers for developing and assessing the performance of new methods for predicting the stability of proteins and their interactions with other protein and/or DNA. ProTherm [
14,
15] is a database that consists of the experimentally measured ΔG
folding of wild-type protein along with single and multiple mutations. In addition, it also provides information about the experimental conditions such as pH and temperature. ProNIT and ProNAB are databases of experimentally determined protein-nucleic acid ΔG
binding [
16,
17]. Both these databases contain a variety of parameters including information about the experimental conditions. Similarly, SKEMPI (Structural Kinetics and Energetics of Mutant Protein Interactions) is a database of experimentally measured binding free energy changes [
18,
19]. It includes data for a wide range of protein-protein complexes, and the mutations are annotated with information about their structural and functional effects.
There are numerous computational methods available for predicting the effect of mutations on protein stability and binding [
20,
21,
22,
23]. These methods can be broadly divided into two categories: empirical/machine learning (ML) methods and physics-based methods. Empirical methods are based on a statistical analysis of experimental data and use machine learning algorithms to predict the effect of mutations on ΔG
folding and ΔG
binding. Physics-based methods, on the other hand, use principles of thermodynamics and statistical mechanics to predict the effect of mutations on protein stability. However, these methods can account for the complex physical interactions that determine protein stability but require detailed structural information about the protein and are computationally expensive, which makes them non-applicable for genome-scale investigations. In this article, we only deal with fast methods, the methods using either adjustable parameters or utilizing machine learning.
For predicting the effect of mutation on protein stability several methods have been developed, which can be broadly grouped into structure-based methods and sequence-based methods. The structure-based methods use the protein structure information to derive the features for the wild-type and the mutant protein and then predict the free energy change of the protein due to mutation. The list of the most popular structure-based methods includes FoldX [
24], PoPMuSiC [
25], mCSM [
26], STRUM [
27], SDM2 [
28], and SAAFEC [
29]. The main limitation of these methods is the availability of the 3D structure of the protein of interest. Indeed, only a tiny fraction of known proteins have 3D structures experimentally determined, which limits the applicability of these methods. This prompted the development of methods that utilize sequence information alone, the sequence-based methods. The most popular include I-Mutant 2.0 [
30], Evolutionary, Amino acid, and Structural Encodings with Multiple Models [
31], Impact of Non-synonymous mutations on Protein Stability [
32], BoostDDG [
33], and SAAFEC-SEQ [
34]. These methods can be applied to genome-scale investigations. Furthermore, it was demonstrated that they outperform some of the structure-based methods despite using sequence information only [
34].
The protein-protein binding affinity change of point mutation has also drawn the attention of the research community. Several computational methods have been reported in the literature for the prediction of binding free energy change due to point mutations. These methods can be classified into physics-based and knowledge-based methods. The knowledge-based/empirical methods are generally fast and hence better suited for genome-level screening applications like FoldX [
24], SAAMBE [
35], SAAMBE-3D [
36], BindProfX [
37], iSEE [
38], BeAtMuSiC [
39], mCSM-PPI2 [
40], MutaBind2 [
41] require the 3D structure of the complex. In addition, there are a couple of sequence-based methods like SAAMBE-SEQ [
42] and ProAffiMuSeq [
43] which require sequence only to predict the ΔΔG
binding due to the mutation.
Similarly, computational methods for predicting the effect of mutation on protein-nucleic acid ΔG
binding have also been developed. The available methods are fewer than the methods for predicting the change of the folding or binding free energy of protein-protein interactions, and they all require structural information. The list is quite short and includes FoldX [
24], mCSM-NA [
44], PremPDI [
45], SAMPDI [
46], and SAMPDI-3D [
47]. It is also to be noted down here that except SAMPDI-3D, which is a machine learning-based method, all other methods available for the prediction are either physics-based or empirical. In addition, only SAMPDI-3D [
47] allows the prediction of change in protein-DNA binding affinity caused by mutation of DNA bases.
The predictions of the effect of mutations on ΔG
folding and ΔG
binding are essential for protein engineering and understanding the effect of natural variants,
i.e., SNVs. We argue that these two tasks may require slightly different approaches and methods. Thus, protein engineering requires methods capable of correctly predicting the effect of any type of mutation on either ΔG
folding and ΔG
binding, with the goal to design more stable proteins or protein-protein and protein-DNA/RNA complexes with better affinity, without any restriction of the type of substitution. In contrast, the methods for predicting ΔGfolding and ΔGbinding of SNVs focus on mutations seen in nature, i.e., in the human population. The goal of this work is to provide an assessment of leading predictors with respect to predicting the change of ΔG
folding and ΔG
binding caused by SNV
versus non-SNV. It should be mentioned that our investigation sheds light on another aspect of performance assessment, which is different from previous works focusing on the effect of enrichment of destabilizing mutations in the existing experimental databases [
48]. Such enrichment was attributed to the less accurate predictions of stabilizing mutations and prompted the creation of balanced datasets [
49,
50]. Other studies on the performance of the leading algorithms suggested that the problem is in overfitting and the features used in the models are not sufficiently informative for the task [
21], and the quality of experimental data as well [
51].
3. Discussion
This article aimed at revealing the differences between SNVs and non-SNVs in terms of their distributions in the corresponding databases and the performance of leading algorithms for free energy change predictions. Three types of databases were considered, the folding free energy changes, the protein-protein binding free energy changes, and protein-DNA binding free energy changes. It should be mentioned that the first two are much larger than the third one and therefore the observations made are statistically more meaningful for the first two. The common observation is that SNVs and non-SNVs are almost equally presented in the databases, roughly speaking 50% are SNVs and 50% are non-SNVs. The corresponding free energy changes, ΔGfolding and ΔGbinding, are similar as well, except for protein-DNA databases, where ΔGbinding of SNVs does not have as many destabilizing cases as non-SNVs do. The main difference between SNVs and non-SNVs in the corresponding databases is the types of mutations. For instance, SNV cases in S2648 are dominated by hydrophobic to hydrophobic and small to small mutations while non-SNVs by large to small and polar to hydrophobic mutations whereas SNV cases in both SKEMPI-SEQ-2388 and SKEMPI-3D-3775 is dominated by small-small and polar to hydrophobic mutations while non-SNV mutations from large to small amino acids. We see more cases of polar to hydrophobic and small to small mutation in the case of SNV for S419 and ProNAB-237 dataset and non-SNV by large to small and polar to hydrophobic mutations. These differences between SNVs and non-SNVs should be taken into consideration in selecting features for machine learning algorithms for predicting the effects of SNVs. Alternatively, if the goal is to develop a method that predicts the effects of SNVs only, only SNVs cases should be used for the training set.
In terms of the performance of the leading predictors of the free energy change, ΔG
folding, and ΔG
binding, we would like to reiterate again that our goal is not to compare their absolute performance but rather to see the difference of the performance on SNVs vs non-SNVs cases. Comparison of their performance has been done in numerous papers of the developers [
26,
28,
30,
32,
34,
36,
39,
40,
41,
42,
44,
45,
47,
53,
54,
55,
56] as well as third-party manuscripts [
57]. The common observation is that almost all algorithms as tested on the corresponding datasets perform worse on SNVs as compared with non-SNVs. In some cases, the PCC for SNVs is two times lower than the PCC for non-SNVs. This observation should be considered in asserting the effect of SNVs, both benign and pathogenic, on protein stability and macromolecular interactions. Especially since there is a strong linkage between thermodynamics and pathogenicity [
58].