1. Introduction.
Design of novel therapeutic agents and the engineering of proteins with desired functionalities are interconnected forefronts of modern biomedical research. In one hand, understanding the thermodynamics and kinetics of protein-protein interactions is essential in identifying and validating potential drug targets [
1], while on the other hand, protein-protein binding energetics plays a pivotal role in rational protein engineering, interface design and modulation of protein-protein interactions [
2]. Apart from the applications, the interaction between protein complexes is also crucial for deciphering their roles in cellular processes, disease mechanisms and signal transduction pathways [
3]. A significant portion of the data concerning protein-protein binding energetics is encoded within the structural features of protein complexes [
4]. The three-dimensional arrangement of proteins within these complexes reveals critical insights into the nature and specificity of their interactions, offering a rich source of information for understanding the energetic basis of their associations [
5].
The interplay between protein-protein binding energetics and the structural characteristics of protein-protein interaction (PPI) complexes has been elucidated with numerous computational approaches. Within the realm of molecular dynamics simulations, a prominent technique extensively employed is the physical effective energy function (PEEF). PEEF relies on theoretically derived interparticle forces that encompass all atoms within a given structure of protein complexes [
6,
7]. The parameters of PEEF are typically obtained from small molecule crystal and solvation data, as well as
ab initio calculations [
8,
9]. However, due to the absence of parameterization from actual protein structures, the PEEF approach encountered challenges in accurately identifying native protein folds [
7]. More specifically, as the physical effective energy function (PEEF) is derived from atomic models, it frequently exhibits a rugged energy surface that lacks a smooth descent when approaching the native state [
10].
A statistical effective energy function (SEEF) addresses the limitations of PEEF by utilizing parameterization from a database of known protein structures to extract statistics related to pair contacts and surface area burial [
11,
12]. This enables the determination of ’pseudo-potentials’ for protein structures or protein-protein interactions. SEEF offers advantages over PEEF, including a smoother energy landscape and reduced sensitivity to small perturbations [
12]. Moreover, its statistical nature allows for the inclusion of all known and potential physical effects, enhancing its robustness [
13]. However, SEEF may exhibit a lower discriminatory power due to this very robustness [
14].
Among other methodologies, a widely employed approach involves combining molecular mechanics energy (MM) with solvation free energy and configurational entropy [
15]. The molecular mechanics energy incorporates bond, angle, dihedral, electrostatic, and van der Waals energy in the gas phase. Conformational entropy is typically computed from normal-mode analysis based on a set of conformational snapshots obtained from molecular dynamics simulations. Solvation free energy, on the other hand, is determined by calculating the change in free energy associated with transferring a molecule from an ideal gas to a solvent at a specific pressure and temperature [
16]. This process considers alterations in solvent accessible surface area and electrostatic interactions between the solute and solvent. The electrostatics part can be determined using either the Generalized-Born (GB) model [
17] or through solving the finite difference Poisson-Boltzman (PB) equation [
18], which leads to the MM/PBSA and MM/GBSA approaches, respectively. Although both methods share the entropic, solvent accessible surface area, and molecular mechanics components, their treatment of electrostatics differs based on the charge model, force field, radius parameter in the continuum solvent model, and solvent dielectric constant. Generally, MM/PBSA outperforms MM/GBSA in predicting protein-protein binding free energies [
15]. However, it is crucial to note that MM/PBSA’s sensitivity to the dielectric constant of the solute necessitates careful calibration based on the charge distribution of the binding interface in PPI complexes [
15].
The utilization of artificial intelligence to predict protein-protein binding affinities is a recent development. Many approaches focus on determining the changes in binding free energy resulting from one or multiple mutations in PPI complexes. For instance, mmCSM-PPI [
19], Geo-PPI [
20], TopNetTree [
21], and PPI-affinity [
22] employ extra-tree, gradient-boosting trees, and support vector machine algorithms to achieve a Pearson’s correlation coefficients (r) of 0.75, 0.52, 0.79, and 0.78, respectively, between predicted and experimental data (ΔΔG) for the SKEMPI 2.0 database [
23], thereby predicting changes in binding affinity upon mutations. In the study conducted by Romero-Molina et al., the PPI-affinity method demonstrated a general applicability for predicting the binding free energy of diverse PPI complexes [
22]. The PPI affinity method achieved an r-value of 0.62 between experimental and predicted binding free energies for a training dataset comprising 833 PPI complexes (Test set 1). However, the r-value dropped to 0.50 when evaluated on a separate hold-out test dataset consisting of 90 PPI complexes (Test set 2) [
22]. Furthermore, the performance of PPI-affinity was compared with other previously available methods to predict the protein-protein binding affinity on both Test set 1 and Test set 2. PRODIGY, another method which employs an SEEF approach, exhibited an r-value of 0.74 on Test set 1 (on which it was trained), but its performance declined with an r-value of 0.31 on Test set 2, indicating potential overfitting towards the benchmark dataset [
24]. Additionally, DFIRE [
25], CP_PIE [
26], and ISLAND [
27] displayed r-values of 0.10, -0.10, and 0.27, respectively, on the hold-out dataset (Test set 2). It is noteworthy that, all of these available methods utilize a SEEF in predicting the protein-protein binding-affinity.
On the other hand, EnCPdock [
28], trained on a dataset comprising 3200 PPI complexes with binding free energies calculated using FoldX [
29], employed a support vector regression approach. Cross-validation of EnCPdock yielded a maximum correlation (r
max) value of 0.745 between the target function (ΔG
FoldX_norm) and the predicted output (ΔG
EnCPdock_norm), with a corresponding maximum balanced accuracy (BACC) score of 0.833. Furthermore, EnCPdock’s performance was evaluated on two independent datasets, namely the Affinity benchmark dataset and the ‘SKEMPI + PROXiMATE–merged’ datasets, comprising 106 and 236 binary complexes, respectively. It achieved correlation coefficients of 0.45 and 0.52, respectively, between the predicted ΔG
EnCPdock_norm and the actual binding free energies for these datasets. Furthermore, EnCPdock offers more than just an AI-predicted ΔG
binding; it also provides essential information such as electrostatic and surface complementarities (Sc, EC), surface area estimates, and other high-level structural descriptors used as input feature vectors. Additionally, EnCPdock delivers a binary PPI complex mapping in the Complementarity Plot (CP) (
https://en.wikipedia.org/wiki/Complementarity_plot) [
30] and generates interactive molecular graphics of the atomic contact network at the interface, along with a contact map for further analysis.
This comprehensive platform facilitates the direct visualization and analysis of specific native interactions (contacts) contributing to binding, offering insights into their stability or transience across a library of mutants. EnCPdock further furnishes individual feature trends and relative probability estimates (Prfmax) of the obtained feature-scores, providing a valuable tool for targeted protein interface design and aiding researchers in identifying structural defects, irregularities, and sub-optimality for subsequent redesign. Combining its wide array of features and applications, EnCPdock stands out as a unique online tool that will undoubtedly benefit structural biologists and researchers across related disciplines. Its capabilities offer valuable support in studying protein interactions and facilitating the design of dockable peptides, making it an invaluable resource for the scientific community.
2. Materials.
EnCPdock was developed utilizing several external programs for various tasks. The ’sc’ program, a part of the CCP4 package [
31] was employed to quantify the shape complementarity at protein-protein interfaces – measured by directly implementing the original shape correlation statistic (Sc) formulated by Lawrence and Colman [
32]. Sc was designed based on the cumulative alignment of the nearest neighboring dot surface points (unit normal vectors) of the interacting molecular (Connolly) surfaces [
33] at protein – protein interfaces (binary complexes). On the other hand, the electrostatic complementarity (EC) function measures the complementarity of surface electrostatic potential at the protein – protein interacting surfaces, arising from the distribution of atomic partial charges across the whole molecular complex. For this purpose, the same molecular (Connolly) surfaces were constructed utilizing EDTSurf [
34] (at 20 dots / Å
2) and the surface electrostatic potentials on these dot surface points were computed by iteratively solving the Poisson – Boltzmann equation by the finite difference method of DelPhi [
35] implementing its smoothed Gaussian dielectric function [
36]. EC was then computed as the negetive correlation of appropriately chosen troughs of surface electrostatic potential values – as detailed in its original and adapted formulations [
30,
37].
To map a PPI complex based on its {Sc, EC} values (treated as an ordered pair), EnCPdock utilized the docking scoring version (CP
dock [
38]) of the two-dimensional Complementarity Plot (CP) [
30]. The Complementarity Plot (CP) serves as a visual aid to validate the structural accuracy of atomic models, applicable to both folded globular proteins [
30,
39] and protein-protein interfaces [
28,
38,
40]. The CP
dock version of the plot represents shape complementarity (Sc) and electrostatic complementarity (EC) of the protein-protein complex attained at their interface on the X-axis, Y-axis respectively. For training, EnCPdock implemented a support vector regression machine with a radial basis function kernel, distributed as SVM
light [
41]. The binding free energy (∆G
binding) of the PPI complexes in the training and test datasets was determined using the standalone version (v.4) of FoldX (
http://foldxsuite.crg.eu/) [
42] which follows a "fragment-based strategy" utilizing fragment libraries similar to the "fragment assembly simulated annealing" technique for protein structure prediction [
14,
43]. Atoms that underwent a net change (non-zero) in solvent-accessible surface area (ASA) upon binding were identified as atoms at the protein-protein interface, wherein the ∆ASA was calculated by NACCESS [
44] with a probe size of 1.4 Å, representing the hydrodynamic radius of water.