2.1. Protein Sample Collection
The datasets consist of two types of human proteins: human cytoplasmic proteins with high abundance and extracellular proteins. First, we collected cytoplasmic proteins with the highest abundance level from the PaxDb database, a collection of experimental data on protein abundance.[
17] The cytoplasmic proteins that were also tagged extracellular keywords (e.g., secreted, extracellular matrix, and extracellular space) in Uniprot were eliminated. Then, proteins in extracellular environments determined with experimental assay were collected (GO ID: 5615).[
18] Finally, 331 human extracellular proteins and 337 HAC proteins within the sequence length range of 100 to 700 were collected for analysis.
The 3D structures of a total of 668 proteins were collected through the Alphafold ver2.0 (Alphafold2) (
https://alphafold.ebi.ac.uk/) protein structure prediction model.[
15,
19] Alphafold2 3D models provide entire protein structures, allowing for comprehensive surface analysis, unlike partial structures often found in experimental Protein Data Bank (PDB) files from X-ray crystallography. Alphafold2 is known to be the top-ranked prediction model with a median global distance test score of 92.4 across all targets and 87.0 on the challenging free modeling category in the 14
th CASP assessment (
https://predictioncenter.org/casp14/zscores_final.cgi). Additionally, in most cases, Alphafold2’s structural prediction accuracy has reached experimental accuracy.[
15]
Even though the overall predictability of Alphafold2 is exceptional, not all predicted structures are suitable for the analysis. Every residue from Alphafold2 3D protein structure is given with per-residue metric, which reflects the structural model confidence called predicted local distance difference test (pLDDT), scaling from 0 to 100. The pLDDT evaluates how well the predicted model agrees with experimental data using local distance difference test Cα.[
20] pLDDT > 90 is considered a high-accuracy cut-off, and pLDDT> 70 can be regarded as generally correct backbone prediction.[
21] When the pLDDT is lower than 50, the predicted region is expected to be intrinsically disordered.[
22] However, a low pLDDT score in Alphafold2 results from high residue flexibility and dynamic structure rather than ‘low confidence.’[
23] Also, since disordered regions of proteins are involved in molecular recognition and hydrophobic interactions, it is essential to include the regions for the analysis.[
24] Considering the potential interpretability difficulty from intrinsically disordered proteins, we set our cut-off value as average pLDDT>50 for a whole protein structure. Finally, we ensured that over 80% of extracellular and HAC proteins had average pLDDT values over 70 (
Figure 2).
2.2. Calculation of Surface Descriptors
Previous studies have introduced several definitions of protein surfaces, each with different characteristics. Among them, we adopted solvent-accessible Surface (SAS) and solvent-excluded surface (SES) for calculating other descriptors (
Figure 1).[
25] The SAS was calculated by rolling probe spheres that have an equivalent size with water molecules. We used SAS for the residue-based analysis: we assumed that a specific residue in a protein could have a maximum SAS when its neighboring amino acids are Glycines. (i.e., having Gly-residue-Gly structure) When the proportion of an actual SAS for a residue to a maximum SAS was higher than or equal to 30%, the residue was defined as surface residue. Another protein surface used in the analysis is SES, also called the Connolly surface.[
26] The surface moves inward from the SAS by a distance identical to the probe sphere radius (
Figure 1). Lewis et al. uncovered that this continuous and functional surface is particularly useful in calculating the protein surface roughness. Then, protein surface descriptors representing various physicochemical, structural, and geometrical descriptors were calculated (
Table 1) based on the two surface types. All the descriptors were computed using Python 3.9.12.
Surface hydrophobicity, charge, secondary structures, and overall morphology of proteins are critical parameters for protein structures. The normalized consensus hydrophobicity scale quantitatively measured the average protein surface hydrophobicity.[
27] Surface charge-related descriptors were collected by calculating the fraction of SAS of negatively charged and positively charged amino acids under physiological conditions (pH=7). Each surface amino acid contributing to the secondary structure was directly extracted by Pymol (
http://www.pymol.org) to calculate the surface proportion of each secondary structure. The surface exposure degree was defined by the SAS divided by the volume of protein.
The B-factor, which is also called the Debye-Waller factor, indicates the thermal motion-induced attenuation of X-ray scattering or coherent neutron scattering.[
28,
29] Equation (1) defines B-factor:
where
(
) denotes the mean displacement of a scattering center. The B-factor is used to interpret properties such as thermostability, flexibility, internal motion, and binding of proteins.[
30,
31,
32,
33,
34] In Alphafold2 models, the B-factor columns are replaced by pLDDT values, which can provide insights into structural flexibility.[
23] We converted pLDDT values into pseudo B-factors since pLDDT values and original B-factor show a reverse relationship. The pLDDT values were first converted into root-mean-square deviation (RMSD) using the following empirical formula (Equation (2)):
where
denotes error estimates. pLDDT values were transformed into the scale of 0-1 from 0-100.[
35,
36,
37] Then, the converted pseudo B-factor is expressed as Equation (3) after substituting the converted error estimates into eq 1, considering the root-mean-square positional variation in three dimensions.
The converted pseudo-B-factors were calculated for each residue in proteins. However, in the case of X-ray analysis, low resolution leads to high B-factors around 100-200, and such high values of B-factors are not recommended to make specific conclusions.[
38] Therefore, only surface residues with RMSD smaller than or equal to 1.5 (almost equivalent to
B ≤ 60) were included for analyzing surface B-factors. Finally, B-factors were normalized as Equation (4) since the non-normalized B-factor does not represent an absolute quantity, therefore, cannot be used to compare different protein structures.[
39]
<B> denotes the average B-factor in the whole protein structure, and σ implies the standard deviation. Then, the mean value of normalized surface B-factors in a protein was used to characterize the protein surface.
Surface roughness, which can be quantitatively characterized by the fractal dimension (
FD), was calculated to identify the surface structural irregularity (Equation (5)).[
26]
and
represent the molecular surface area and rolling probe radius, respectively.
FD falls within the range of 2 to 3, having the smoothest surface on 2 and having the roughest surface on 3. As for the calculation of
, we calculated SES using 3V calculator (
http://3vee.molmovdb.org).[
40] Then, Equation (5) was transformed into Equation (6) for the convenience of calculation.
i refers to a probe radius starting from 1.2, in the range of 1.0 to 3.6, with the interval of 0.2 (1.0,1.2,1.4, 1.6 … 3.6,
N (Number of sets)=13).
i-1 refers to the previous step of
i (
i-1 starts from 1.0).
indicates the log value of solvent excluded surface area under the probe radius
i. The range of probe radius is suitable for the analysis since the probe sizes are sensitive to specific interactions between residues, reflecting the size of water molecules and side chains.[
26] Finally, the mean value of all the calculated
Di represents the
FD.
2.3. Application of Machine Learning
The logistic Regression (LR) model, a regression model for binary classification problems, shows its chief advantage in providing high model interpretability. An odds ratio of each independent variable enables quantitatively evaluating its contribution to dependent variables. Surface descriptors were given as independent continuous variables, and <HAC:1, Extracellular:0> tags were provided as dependent dichotomous variables in the models. Then, Equation (7) represents the probability of being HAC proteins under the given independent variables[
41]:
P, xi, βi denote the probability of being a HAC protein, a surface descriptor, and an accompanying beta coefficient. LR uses a method of maximum likelihood to estimate
and the odds ratio corresponds to
exp[
βi]. Then, logistic transformation, which converts the non-linear relationship into the original linear regression equation, is applied as eq 8.
A positive indicates that the increase in xi leads to the stochastic increase in the probability of being HAC proteins. Conversely, A negative means that an increase in xi results in the stochastic decrease for the probability of being HAC proteins.
As a parametric model, LR requires several statistical assumptions to perform well.[
41] Thus, several data preprocessing steps were conducted, including checking the multicollinearity of surface descriptors, deletion of strongly influential outliers, and data scaling to meet the assumptions and enhance the model performance. Pearson correlation (PC) analysis, a statistical test that measures the linear association between two variables, was conducted to limit the multicollinearity problem. Also, Cook’s distance from the statsmodels module in Python was calculated for leverage and residual values analysis. Conclusively, 1.03% of proteins turned out to be highly influential and outliers simultaneously and were eliminated from the dataset. Finally, the surface descriptors were standardized with StandardScaler function in Python sci-kit learn library for data scaling.
Upon constructing the LR model, several popular supervised learning algorithms for classifications: K-Nearest Neighbor (KNN), Random Forest (RF), and Support Vector Machine (SVM), were used to compare the performance of different models. All the algorithms were performed through Scikit-learn Package in Python 3.9.12. The hyperparameters for each algorithm were optimized using GridSearch cross-validation (CV), where every parameter combination is tested to evaluate ML models. 5-fold cross-validation was used to avoid overfitting to the test set. Before constructing machine learning models, datasets were randomly divided into a training set (80%) and a test set (20%), maintaining the original ratio of the target class. Then, the performance of different models was assessed by the predictive indicators: the classification accuracy and the area under the curve of receiver operating characteristic (AUC-ROC) curve. We repeated five times randomly splitting training and test sets to avoid sampling bias and overfitting, then reported the mean accuracy of each model. We selected the final ML model, LR, for the feature importance analysis considering high accuracy and model interpretability. Finally, each descriptor's significance and importance were explained with statistical analysis.