Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Finding Optimal Models of Random Forest and Support Vector Machine through Tuning Hyperparameters in Classifying the Imbalanced Data

Version 1 : Received: 18 August 2024 / Approved: 19 August 2024 / Online: 19 August 2024 (14:32:51 CEST)

How to cite: Handoyo, S.; Chen, Y.-P.; Wibowo, R. B. E.; Widodo, A. W. Finding Optimal Models of Random Forest and Support Vector Machine through Tuning Hyperparameters in Classifying the Imbalanced Data. Preprints 2024, 2024081318. https://doi.org/10.20944/preprints202408.1318.v1 Handoyo, S.; Chen, Y.-P.; Wibowo, R. B. E.; Widodo, A. W. Finding Optimal Models of Random Forest and Support Vector Machine through Tuning Hyperparameters in Classifying the Imbalanced Data. Preprints 2024, 2024081318. https://doi.org/10.20944/preprints202408.1318.v1

Abstract

Imbalanced classes can cause machine learning models to classify positive class instances poorly, and the models require hyperparameter sets to optimal values. This study aims to develop random forest (RF) and support vector machine (SVM) models that use the optimal hyperparameters obtained through 5-fold cross-validation data. Both models were trained using optimal hyperparameter pairs that fit two data scenarios, the original and oversampling training sets, to produce the benchmark and best models. The model's performance was evaluated using six metrics, including training and testing data. The acquired optimal RF hyperparameter pair was (500, 10) for the minimum number of instances and tree depth level. In contrast, the acquired optimal SVM hyperparameter pair was (0.001, 500) for the gamma and constant values. The benchmark model performed approximately 98% on Accuracy, Precision, Recall, and F1 score metrics but failed to deliver any performance on Mathew's Correlation Coefficient (MC) and Area under Curve (AUC) metrics. The best RF and SVM perform less than both benchmark models in four famous metrics. Both best models have performance improvements of approximately 6% and 11% for the MCC and AUC metrics, respectively. The best RF performance was slightly better than the best SVM performance.

Keywords

Area Under Curve; Cross-validation data; Mathew's Correlation Coefficient; Optimal hyperparameters; Oversampling method

Subject

Computer Science and Mathematics, Logic

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.