3.1. Performance of MQSSMs at the Calibration Phase
This segment focuses on managing the initial amount of 210 models via a reduction process. In the preliminary phase of calibration, an evaluation of these 210 models was conducted utilizing three datasets: TR Starpep, TS StarPep, and EX Starpep. These datasets, predominantly derived from StarPepDB, demonstrated a relatively uniform behavior across models’ performance, as depicted in
Figure 1. Such an outcome is expected considering the considerable overlap in sequences shared between these datasets and the employed "Query."
During the next phase of calibration, six datasets not associated with StarPepDB were incorporated (
Figure 1). Predictably, in this phase, the models reduced their effectiveness compared to the initial round. A critical revelation from this stage was a better performance of models when dealing with datasets composed of both randomly generated negative sequences and sequences yet to be experimentally tested. In contrast, when the datasets had experimentally validated sequences, the performance metrics dropped. This behaviour might be linked to the fact that a significant number of these negative sequences closely resembled the positive ones. Consequently, the methods based on alignment faced challenges in differentiating unique attributes of each category.
This problem was especially pronounced in datasets like ENNAVIA-A and Thakur, where experimental sequences were used as negative datasets. Conversely, with ENNAVIA-B, which contained the same positive sequences, the models more effectively identified non-antiviral sequences. It is important to emphasize that experimentally validated negative sequences are scarce in comparison to positive sequences. Hence, the predominant challenge in modelling lay in improving the detection rate (recall) of positive sequences overall.
From the calibration stage, clear trends in the MQSSMs parameters emerged. A notable observation was that models with larger scaffolds performed better. This improvement was attributed to the more detailed characterization of the AVPs' chemical space. Scaffolds such as Md4, Md5, SG4, SG5, SL4, and SL5 generated the most model variants (Table SI1.1). In terms of alignment, global alignment proved more effective with lower sequence identity percentages, while local alignments were more successful with higher percentages. For simpler scaffolds, global alignments consistently outperformed local alignments, regardless of the identity percentage.
3.2. Performance of MQSSMs at the Validation Phase
In the validation stage, we assessed the models against datasets with sequences targeting specific virus like SARS-CoV. Datasets including ENNAVIA-C, ENNAVIA-D, and Imb_CoV were used for this purpose. As shown in the
Figure 1, the base models had poor performance with ENNAVIA-C and Imb_CoV. This result was expected because the model references are from 2019. This highlights the importance of the "Query" dataset's representation in model performance. However, the models showed better performance on the ENNAVIA-D dataset, which contained random negative sequences. This aligns with previous observations of model behaviour in similar datasets.
During the model selection process, 32 models chosen in the first round of the validation were tested against the Expanded dataset. From this, 12 models were identified as top performers using a multi-variable Friedman ranking method. These models, labelled M1 to M12, included 6 based on global alignment and 6 on local alignment strategies. The parameters for these models are summarized in the Table SI1.2. Additionally, from these 12 models, the top 3 performing ones were further singled out. This selection focused efforts on fine-tuning these models, particularly towards enhancing the recovery of positive (AVP) sequences.
3.3. Improving MQSSMs Performance by Fusing Scaffolds
Continuing from the focused analysis of the top models for positive sequence recovery, an initial approach to enhancement involved combining the scaffolds from models M3, M7, and M12 (md4, SL5, SG5) into a single, consolidated scaffold. This step included the removal of duplicate sequences, culminating in a scaffold that contained 3206 unique sequences. Subsequent testing of this modified scaffold indicated a slight improvement in the performance of models using global alignment with a 90% similarity threshold. Labelled as M13, a new model was crafted using these parameters. The improvements in M13 were primarily due to the expanded representation of sequence space, enriched by the increased number of sequences.
Although the modifications provided insights into the nuances of the MQSSMs, their overall efficacy remained unsatisfactory. The subsequent strategy aimed to leverage the performance information gathered during the calibration phase. As shown in
Figure 1, the base models struggled with datasets such as Thakur, ENNAVIA-A, AMPfun, and AVPiden. This struggle was likely due to the inadequate representation of diverse sequences from these datasets in the MQSS scaffolds. Furthermore, the presence of many experimentally validated negative sequences in some datasets increased the difficulty of making accurate predictions.
3.4. Improving MQSSMs Performance by Enriching the Best Scaffolds
In tackling the identified issues, a Half Space Proximal Network (HSPN) was made by pooling together positive sequences from the challenging datasets (Thakur, ENNAVIA-A, AMPfun, and AVPiden). A total of 2403 sequences were employed in constructing the HSPN. This effort resulted in the production of 8 scaffolds, out of which the top two were chosen to enhance the existing scaffolds. Consequently, this led to the introduction of new, improved models: M3+, M7+, M12+, and M13+. The "+" in their names signifies their references enrichment. Post-enhancement, these scaffolds contained 3155, 3437, 3472, and 3606 sequences, respectively. To incorporate unique scaffolds do not present in StarPepDB, we crafted E1 and E2 models using an external scaffold, comprising 1517 and 1261 sequences, respectively. This increased our analysis pool to 10 models, adding 6 newly enriched models to the pre-existing M3, M7, M12, and M13. Complete details of these models are outlined in Table SI1.2.
Following their development, the 10 models were extensively tested across 15 databases, the 14 datasets from our workflow plus the Expanded Dataset (File SI3). We used a Friedman ranking system to halve the number of top-performing models, evaluating them based on accuracy (ACC), specificity (SP), sensitivity (SN), Matthew's correlation coefficient (MCC), and F1 score. The ranking results identified M3+, M13+, M7, M12, and E1 as the most effective, as highlighted in grey in
Table 3.
Notably, while models with a greater number of reference sequences like M3+ and M13+ ranked high, models such as E1, M7, and M12 with fewer sequences performed comparably well. This suggests that the effectiveness of the references hinges more on their diversity and representational range than on their quantity.
3.5. Benchmarking the Best MQSSMs against State-of-the-Art Predictors
The top 5 models were subsequently benchmarked against existing predictors in the literature, providing a comparative assessment of their performance relative to other available tools. To ensure unbiased comparisons, any sequences shared between the scaffolds of models M3+, M7, M12, M13+, E1, and the Reduced dataset were eliminated. This action reduced the number of positive sequences to 116, while the number of negative sequences remained unchanged. It is important to note that many negative sequences in the Reduced dataset are part of the training sets for the external predictors, but these were not removed. This decision placed our models at a significant disadvantage in terms of performance evaluation. We tested a total of 14 external predictors (File SI4), using evaluation metrics such as ACC, SP, SN, MCC, FPR and F1 Score. MCC, which is unaffected by the dataset's imbalance, was the primary metric for our analysis.
The alteration made to the Reduced dataset resulted in a marked decrease in the performance of the MQSSMs. This decline was especially pronounced in the SN and MCC values. However, ACC and SP remained relatively stable, a situation due to the substantial imbalance between positive and negative cases in the class distribution. The notable drop in sensitivity underscores a significant, consistent shortfall in correctly identifying positive sequences. This inability to accurately recall true positives notably affected the MCC, with a stark decrease observed, for example, in the M13+ model where MCC plummeted from 0.731 to 0.214, as detailed in
Table 4. The F1 score, which depends on both recall and precision, also experienced a corresponding decline.
Despite the less-than-ideal results, the MQSSMs still surpassed the external predictors in overall performance. Analysing the performance data of the external predictors,
Figure 2 highlights two distinct trends. Some models are highly effective at identifying most positive sequences, resulting in a high SN but with the trade-off of a higher rate of false positives. Conversely, other models excel at correctly classifying all negative sequences but tend to misclassify many positive ones, a trait observed in the MQSSMs we developed. Typically, most models based on deep learning fall into the former category, demonstrating high sensitivity, whereas traditional ML models are more likely to be in the latter category, with a stronger emphasis on specificity.
The analysis revealed that no single predictor excelled in all evaluation categories, supporting Garcia-Jacas et al.’s conclusion that DL methods may not be the most effective for AVP prediction [
36]. Extending this observation, it's evident that none of the ML models tested demonstrated satisfactory performance, suggesting a need for significant improvements. A central issue identified is the quality and representativeness of the training data. Most positive sequences used in training exhibit high similarity, often up to 90%. Moreover, the experimentally validated negative sequences often mirror their positive counterparts. The challenge, therefore, does not lie in the complexity of the models’ architectures but rather in the data's availability and diversity. This highlights an ongoing challenge in gathering and effectively utilizing comprehensive data for such models.
Evaluating
state-of-the-art predictors revealed a common issue of accessibility. Many predictors were difficult to assess for various reasons. A significant issue was the poor construction of several web servers, leading to operational failures or frequent malfunctions. This problem affected even relatively new servers, those less than 2 years old. Furthermore, a few research repositories lacked comprehensive instructions, complicating their use. This difficulty in accessing and implementing these tools echoes concerns previously noted by researchers [
46] about the availability of source codes. Nonetheless a comprehensive summary of the prediction tools currently available is provided in the
supplementary information (Table SI1.4).
In contrast to these challenges, the MQSSMs distinguish themselves with their accessibility. They are available through the StarPep toolbox standalone software, which boasts a user-friendly interface, making these models more approachable and easier to use for users.
Despite the need for improvements to address their shortcomings, the MQSSMs still have a performance edge over traditional ML models. This is substantiated by a Friedman MCC, ACC, SP, SN, and the F1 Score. An important aspect to note is that the MQSSMs require considerably fewer computational resources and are not constrained by sequence length limitations, presenting significant benefits.
When choosing a prediction model, it’s essential to align with the specific needs of the researcher. Some may prioritize identifying a larger number of potential AVPs, while others might prefer a smaller, more accurate set to reduce false positives. Considering the resource-intensive nature of experimental procedures, the latter approach is often more practical for synthesizing potential AVPs, as it optimizes the balance between resource use and the likelihood of accurate predictions.