2.2. The Virtual Screening Program.
The following machine learning models can be used for a virtual screening program:
Linear Regression is a machine learning method that uses a linear model to predict numerical values. It finds a linear relationship between input data and output values that allows you to make a prediction for new data.
- 2.
Support Vector Machine Regression
Support Vector Machine Regression is a machine learning method used for regression tasks. It finds the hyperplane in the input space that separates the values of the output variable the most. The new input data can then be used to predict the output value.
- 3.
Random Forest Regression
Random Forest Regression is a machine learning method used for regression tasks. It creates several different random forest trees, each of which predicts the value of the output variable. The predictions of each tree are then combined to get the final result.
- 4.
Gradient Boosting Regression
Gradient Boosting Regression is a machine learning method used for regression tasks. It creates successive models, each of which corrects the errors of the previous model. The predictions of each model are then combined to produce the final result.
- 5.
K-Nearest Neighbors Regression
The K-Nearest Neighbors (KNN) method is one of the machine learning algorithms used for classification and regression. This method searches for the k-nearest points (neighbors) to the point under study in the input data space and uses their values to predict the output variable for the point under study.
Feature scaling
Before building the models, we need to scale the features. While linear regression and random forest algorithms do not require this procedure, other methods, such as k-nearest neighbors, do, as they rely on the calculation of Euclidean distance.
For our purposes, we will use the MinMaxScaler.
# Create the scaler object with a range of 0-1
scaler = MinMaxScaler(feature_range=(0, 1))
# Fit on the training data
scaler.fit(train_features)
# Transform both the training and testing data
train_features = scaler.transform(train_features)
test_features = scaler.transform(test_features)
Evaluation of different models
We will prepare a number of auxiliary functions for running models and evaluating them
# Function to calculate mean absolute error
def mae(true_labels, predicted_labels):
return mean_absolute_error(true_labels, predicted_labels)
# Takes in a model, trains the model, and evaluates the model on the test set
def fit_and_evaluate(model, train_features, train_labels, test_features, test_labels):
# Train the model
model.fit(train_features, train_labels)
# Make predictions and evalute
model_test_pred = model.predict(test_features)
model_train_pred = model.predict(train_features)
model_test_mae = mae(test_labels, model_test_pred)
model_train_mae = mae(train_labels, model_train_pred)
# Return the performance metric
return model_test_mae, model_train_mae
2.2.1. Linear Regression Model
The result of this model is an error of 20% on the test sample and 16% on the training sample, respectively
lr = LinearRegression()
lr_mae_test, lr_mae_train = fit_and_evaluate(lr, train_features, train_labels, test_features, test_labels)
print('Linear Regression Performance on the test set: MAE = %0.4f' % lr_mae_test)
print('Linear Regression Performance on the train set: MAE = %0.4f' % lr_mae_train)
Linear Regression Performance on the test set: MAE = 20.0076
Linear Regression Performance on the train set: MAE = 16.0950
2.2.2. Regression Model Using the Support Vector Method
The result of this model is an error of 18.97% on the test sample and 17.47% on the training sample, respectively
svm = SVR()
svm_mae_test, svm_mae_train = fit_and_evaluate(svm, train_features, train_labels, test_features, test_labels)
print('Support Vector Machine Regression Performance on the test set:
MAE = %0.4f' % svm_mae_test)
print('Support Vector Machine Regression Performance on the train set:
MAE = %0.4f' % svm_mae_train)
Support Vector Machine Regression Performance on the test set: MAE = 18.9718
Support Vector Machine Regression Performance on the train set: MAE = 17.4767
2.2.3. Random Forest Model
The result of this model is an error of 23.72% on the test sample and 6.94% on the training sample, respectively
random_forest = RandomForestRegressor(random_state=0)
random_forest_mae_test, random_forest_mae_train = fit_and_evaluate(random_forest, train_features, train_labels, test_features, test_labels)
print('Random Forest Regression Performance on the test set: MAE = %0.4f' % random_forest_mae_test)
print('Random Forest Regression Performance on the train set: MAE = %0.4f' % random_forest_mae_train)
Random Forest Regression Performance on the test set: MAE = 23.7254
Random Forest Regression Performance on the train set: MAE = 6.9407
2.2.4. Graded Boosting Model
The result of this model is an error of 21.98% on the test sample and 3.6% on the training sample, respectively
gradient_boosted = GradientBoostingRegressor(random_state=0)
gradient_boosted_mae_test, gradient_boosted_mae_train = fit_and_evaluate(gradient_boosted, train_features, train_labels, test_features, test_labels)
print('Gradient Boosted Regression Performance on the test set: MAE = %0.4f' % gradient_boosted_mae_test)
print('Gradient Boosted Regression Performance on the train set: MAE = %0.4f' % gradient_boosted_mae_train)
Gradient Boosted Regression Performance on the test set: MAE = 21.9783
Gradient Boosted Regression Performance on the train set: MAE = 3.5952
2.2.5. K-Nearest Neighbors Model
The result of this model is an error of 18.78% on the test sample and 14.21% on the training sample, respectively.
knn = KNeighborsRegressor()
knn_mae_test, knn_mae_train = fit_and_evaluate(knn, train_features, train_labels, test_features, test_labels)
print('K-Nearest Neighbors Regression Performance on the test set: MAE = %0.4f' % knn_mae_test)
print('K-Nearest Neighbors Regression Performance on the train set:
MAE = %0.4f' % knn_mae_train)
K-Nearest Neighbors Regression Performance on the test set: MAE = 18.7883
K-Nearest Neighbors Regression Performance on the train set: MAE = 14.2183
The graph below demonstrates that the support vector and k-nearest neighbors models show similar results on the test sample and demonstrate the smallest error (about 18%).
plt.style.use('fivethirtyeight')
# Dataframe to hold the results
model_comparison = pd.DataFrame({'model': ['Linear Regression', 'Support Vector Machine',
'Random Forest', 'Gradient Boosted',
'K-Nearest Neighbors'],
'mae_test': [lr_mae_test, svm_mae_test, random_forest_mae_test,
gradient_boosted_mae_test, knn_mae_test],
'mae_diff': [lr_mae_test - lr_mae_train, svm_mae_test - svm_mae_train, random_forest_mae_test - random_forest_mae_train,
gradient_boosted_mae_test - gradient_boosted_mae_train, knn_mae_test - knn_mae_train]
})
# Horizontal bar chart of test mae
model_comparison.sort_values('mae_test', ascending = False).plot(x = 'model', y = 'mae_test', kind = 'barh',
color = 'red', edgecolor = 'black')
# Plot formatting
plt.ylabel(''); plt.yticks(size = 14); plt.xlabel('Mean Absolute Error'); plt.xticks(size = 14)
plt.title('Model Comparison on Test MAE', size = 20);
Upon comparing the error differences between training and test samples, we observe that the "support vectors" model exhibits the greatest generalization ability, while both the "random forest" and "graded boosting method" models display clear signs of overfitting (
Figure 1 and
Figure 2).
# Horizontal bar chart of test mae
model_comparison.sort_values('mae_test', ascending = False).plot(x = 'model',
y = ‘mae_diff’, kind = ‘barh’,
color = ‘blue’, edgecolor = ‘black’)
# Plot formatting
Plt.ylabel(‘’); plt.yticks(size = 14); plt.xlabel(‘Mean Absolute Error’);
plt.xticks(size = 14)
plt.title('Model Comparison on Difference between Test MAE and Train MAE', size = 20);
Following this, we proceed to optimize the models and visualize their performance:
Support vector method
The optimization of the model resulted in an enhancement of prediction accuracy on the training set. However, this improvement was accompanied by a decrease in performance on the test set (
Figure 3.).
Random forest model
The optimization of the model resulted in a notable enhancement of generalization ability and effectively eliminated overfitting (
Figure 4.). The error rate on the training set was 14.77%, and on the test set - 17.69%.
This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.
K-nearest neighbors model
Optimization of the model led to overtraining of the model and loss of generalization properties (
Figure 5).
Gradient boosting model
Optimization of this model led to a significant improvement in the generalization ability and eliminated overfitting (
Figure 6). The error rate on the training sample was 11.81%, and on the test sample - 16.65%, which is the best indicator among all models.
Following our analysis, we tested several models for solving regression problems. The models that performed best without optimization were the support vector machines (SVM) and k-nearest neighbors (KNN) models.
After optimizing the models, the gradient boosting model demonstrated the best generalizing ability, achieving an error rate of 16%. This optimized model can be used to predict antioxidant activity based on quantum chemical parameters.
Furthermore, using the electronic topological approach, the Department of Medical and Pharmaceutical Informatics and New Technologies at ZSMPU developed a virtual screening program.
Figure 7.
The interface of the virtual screening program.
Figure 7.
The interface of the virtual screening program.