Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance

Version 1 : Received: 9 September 2024 / Approved: 9 September 2024 / Online: 9 September 2024 (11:21:00 CEST)

How to cite: Safi, S. K.; Gul, S. An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance. Preprints 2024, 2024090681. https://doi.org/10.20944/preprints202409.0681.v1 Safi, S. K.; Gul, S. An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance. Preprints 2024, 2024090681. https://doi.org/10.20944/preprints202409.0681.v1

Abstract

Machine learning methods used for classification can face challenges due to class imbalance, where a certain class is underrepresented. Over/under-sampling of minority/majority class observations, or model selection for ensemble methods alone, may not be effective if the class imbalance ratio is very high. To address this issue, this paper proposes a method called enhance tree ensemble (ETE) based on generating synthetic data for minority class observations in conjunction with tree selection based on their performance on the training data. The proposed method first generates minority class instances to balance the training data and then uses the idea of tree selection by leveraging out-of-bag (〖ETE〗_OOB) and sub-samples (〖ETE〗_SS) observations, respectively. The efficacy of the proposed method is assessed using twenty benchmark problems for binary classification with moderate to extreme class imbalance, comparing it against other well-known methods such as optimal tree ensemble (OTE), SMOTE random forest (〖RF〗_SMOTE), over-sampling random forest (〖RF〗_OS), under-sampling random forest (〖RF〗_US), k-nearest-neighbor (k-NN), support vector machine (SVM), tree, and artificial neural network (ANN). Performance metrics such as classification error rate and precision are used for evaluation purposes. The analyses of the study revealed that the proposed method, based on data balancing and model selection, yielded better results than the other methods.

Keywords

Random Forest; Tree Selection; , Classification; Class-Imbalance Problem; Synthetic Data Generation

Subject

Computer Science and Mathematics, Probability and Statistics

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.