Abstract
ABSTRACT Background Ubiquitination plays an important role in protein post-translational processes and has been found to be involved in a number of regulatory functions including proteasome degradation, DNA repair, transcription, signal transduction, endocytosis, and sorting. As the identification of ubiquitination site is critical to furthering our understanding of the mechanism of ubiquitination, various experimental and machine learning methods have been used to conduct this task. It has been an important but challenging task to improve the accuracy of ubiquitination site prediction. In this research, we explore the possibility of improving the prediction performance of machine learning by incorporating grid search in the training process. Method We developed grid search procedures for each of six widely used machine learning methods including NB, LR, DT, SVM, LASSO, and KNN, and applied them to ubiquitination site prediction using the six PCP datasets that were previously developed. For each of the ML methods, we developed a set of values for each of the tunable hyperparameters available to the method. These sets of values then can be combined to form a large grid of hyperparameter settings, and each of these settings is used in the grid search. We integrated 5-fold cross-validation in grid search to train and test ML models and applied an additional independent validation procedure by conducting a pre-training 80-20 sample split. We evaluated the performance of the six methods by comparing them side by side for each of the six datasets. We also compared the grid search results with the results that were previously published without doing grid search. To optimize the prediction performance, we trained 1.1 million ML models in total through grid search. Results We compared the overall prediction performance of these six methods, as well as their prediction performance when working with balanced vs. imbalanced data, and large-scale vs. small-scale data. From the perspective of dataset, we find that the overall performance of every PCP dataset has been significantly improved in this study compared to the previous study, with the percentage increase of the average AUC of all six datasets ranging from 7.9% (PCP-4) up to 17.0% (PCP-1). From the perspective of method, we find that three out of four methods significantly benefit from grid search comparing to their previously published non-grid search results, with the maximum AUC improvement as high as 47% (LASSO on PCP-5), 43.3% (NB on PCP-1), and 33.7% (SVM on PCP-6). SVM overall ranks number one, followed by KNN as the number two performer based on their average AUCs on all datasets. But these two also ranked the top two (KNN 76 days and SVM 15 days) in terms of the total running time that they need to do grid search. We also find that SVM, KNN, and DT tend to handle small-scale and imbalanced datasets better, while LR, and LASSO are doing well with large-scale and balanced datasets. NB is more sensitive to data imbalance while less sensitive to the size of a dataset. Conclusions Our results show that using grid search has improved the performance of ubiquitination prediction significantly. We find that the performance of a method is closely related to its hyperparameter setting and the type of data it handles. Even though SVM is on average an outperformer, none of the methods can provide the best performance for all datasets. When sufficient computing resources are well accessible, grid search is an effective way to identify both a top performing model for a machine learning method and a suitable machine learning method for a particular dataset.