2.1. Balancing the Training Data
Let Υ= (X, Y) be the given data; X is an
p matrix, that is,
where
and
, and Y are two-class vectors, that is,
(0,1) of length n.
are the values of the binary response variable with a value 1for one class and a value 0 for the other class; that is,
Out of the total n sample points, there were
majority and
minority class observations, with a severely skewed class distribution, i.e., i.e.,
. The following procedures were considered to balance the training data before building the tree ensembles. The given dataset Υ was balanced by selecting
=
bootstrap samples each of size, i.e.
, equal to the number of observations in the minority class, from the minority class observations. Let the sample be
, where v = 1, 2, …,
. Each bootstrap sample had
observations of p features. Each bootstrap sample was used to generate an additional row in the data by estimating the mean
of each column if it was numeric and mode
if it was categorical. Let there be
continuous features and
categorical features in the bootstrap samples; calculating the means and modes of the features gives a vector of
observations, that is,
where r = 1, 2, …,p, and the elements in
are arranged according to the original order of the features in the bootstrap sample. The generated rows are arranged in the matrix, and the last column comprises the class labels of the minority class, that is,
where,
denotes a newly generated vector. The given training data was combined with the generated data
to obtain balanced data, that is,
For each class, there were equal numbers of data points in . To grow optimal trees, the data was used instead of the original data . Two methods are used to grow and select optimal trees ensemble using balanced data . The first method uses the corresponding out-of-bag (oob) observations to assess each tree individually. Trees were ranked based on how well they performed individually based on oob observations. The second method involved random subsets of training data for tree growth. In this study, balanced data, i.e., was used in conjunction with an ensemble method, i.e., optimal tree ensemble classifier using out-of-bag, i.e., OTEC (oob) and optimal tree ensemble classifier using sub-sample, i.e., OTEC (sub), on balanced data.
2.2. Out-of-Bag based Assessment with Balanced Data
The first proposed method, i.e., OTEC (oob) used out-of bag (oob) observations for trees grown on the balanced training data
. Studies [
28,
29] have observed that samples during bootstrapping omit approximately 1/3 of the overall training data [
30]. These observations play no role in model construction and can be used as independent validation data for model assessment. From the balanced data
, we took T bootstrap samples, i.e.,
,
= 1,2,3, …,T , and let
be the corresponding value of oob observations from each sample. G(
) represents the classification tree that was grown on
. It was further assumed that
represented the oob error, which is the error of G(
) on
, i.e.,
where, y is the true class label in the bootstrap sample, i.e.,
,
is the corresponding predicted value via classification tree G(
), and |
| is the size of the oob sample.
is an indicator function, expressed as:
G(
), After growing the desired number of classification trees, they were arranged in ascending order of their error rates on the oob samples. Let the top-ranked, second-top-ranked, etc. trees be
A number of trees from the above-ranked trees were selected for the final ensemble. The ensemble was then used to predict the new/test data. The number of trees grown and the number of trees selected were two potential hyper-parameters for this method.
2.3. Sub-Sample based Assessment with Balanced Data
The second proposed method, i.e., OTEC(sub) used sub-sample based method where trees were grown on sub-samples from the balanced data (). Unlike the oob observations, the remaining observations from each sample acted as test data for evaluating the predictive performance of each corresponding tree. Given that , = 1,2,3, …, T be the random sample of size m < n. Additionally, let represent the corresponding remaining subset of observations of size n - m. G(), where, =1,2, …,T, is the classification tree built on . It is also assumed that the error of G() on is represented by . On , = 1,2, …,T, for the T classification trees. was used to estimate on each tree.
Let the top, second highest, and so on ranked trees be, that is,
The remaining procedure was the same as that for OTEC(sub). This method might be useful in small-sample situations in which one wants to retain large amounts of training data to build trees. This method can also be tuned by selecting the optimal values for the initial number of trees grown and the number of trees selected for the final ensemble. The pseudo-code of the proposed ensemble is provided in Algorithm 1.
This study considered extremely imbalanced classification datasets to assess the performance of the proposed method for benchmark problems. The efficacy of the proposed method, i.e., OTEC(oob) and OTEC(sub) in comparison with the state-of-the-art methods, i.e., optimal tree ensemble (OTE), random forest in conjunction with using synthetic minority over-sampling technique (RF(smote)), random forest combined with over-sampling method (RF(over)), under-sampling method coupled with random forest (RF(under)), support vector machine, k-nearest neighbor (k-NN), artificial neural network (ANN), and classification tree (Tree) was assessed using 20 benchmark datasets. The evaluation metrics considered were the classification error rate, sensitivity, specificity, precision, recall, and F1 score, which is a measure that combines precision and recall into a single value, providing a balanced assessment of a model’s performance. R programming software was used to perform the experiments.
Table 1 provides a concise summary of the datasets considered. The data name is displayed in the second column and the numbers of instances/observations (n) and features (p) are shown in the third and fourth columns, respectively. The fifth column provides the classwise distribution, the sixth column provides the imbalance ratio (
), and the final column contains the data source. Additional information on the remaining datasets is provided in
Table S1. This section describes the experimental design of this study.
The proposed methods, OTEC(oob) and OTEC(sub), were applied to extremely imbalanced datasets, with 95% of the observations belonging to the majority class and 5% to the minority class. This indicates that the observations were made at a 19:1 ratio. This was performed by randomly discarding minority class observations in which the original data did not have an uneven ratio of 19:1. A total of 1,000 realizations were made from the resulting data, which were split into 90% training and 10% testing parts. Model fitting was performed on the training part of the data, and evaluation was performed on the testing part. To generate T = 1,000 classification trees, bootstrap samples from 90% of the training data were obtained using the OTEC(oob) and OTEC(sub) methods described in Algorithm 1. The testing part was used for external validation.
The final ensemble contained all of the top
trees that were chosen. The final result was the average of all 1,000 runs. The training and testing parts were the same for all methods under consideration. The findings presented in
Table 2 and
Table 3, along with the box plots displayed in
Figure 1,
Figure 2,
Figure 3 and
Figure 4, show the results of the proposed method, that is, OTEC(oob) and OTEC(sub), in terms of the classification error rate and precision compared with the other methods. The results for the rest of the methods, that is, OTEC(oob) and OTEC(sub), in terms of sensitivity, specificity, recall, and F1 score, are given in
Tables S2–S5 of
Supplementary Materials. The plots are shown in
Figures S1–S14 in
Supplementary Materials.
Table 2 shows the results of the proposed method’s classification error rate for the datasets with those of the other state-of-the-art methods. The OTEC(oob) results are shown in bold numbers, and the OTEC(sub) results are italicized. It is evident from the tables that the proposed methods, that is, OTEC(oob) and OTEC(sub), outperform all other procedures in terms of the classification error rate, having the lowest error rate compared to the other methods. The proposed methods, that is, OTEC(oob) and OTEC(sub), showed minimal classification error rates ranging from 0 to 0.0005. OTEC(sub) did not perform well in breast cancer, drug classification, glass classification, or KDD. By contrast, RF (smote) yielded a low error rate on the glass classification dataset. The RF(under), k-NN, SVM, and ANN did not perform well on any of the datasets. In terms of the classification error rate, OTEC(oob) performed better on 19 out of 20 datasets than the other methods. For 17 of the 20 datasets, OTEC(sub) performed better in terms of classification error rate. However, the other methods failed to yield satisfactory results.
Table 3 presents a detailed comparison of the proposed method with other state-of-the-art methods on the datasets. OTEC(oob) yielded better results for the majority of datasets in terms of precision. The results given in Table 4 show that OTEC(oob), in terms of precision, yields promising results ranging from 93.36% to 100% for several datasets. This demonstrated the effectiveness of the proposed method. In contrast, OTEC(sub) provided better results in terms of precision for 14 datasets, except for the breast, liver disorder, and ionosphere. Moreover, RF (smote) and RF (over) demonstrate high precision in kc2 and glass classification. The box plots revealed the best findings of the proposed method, that is, OTEC(oob) and OTEC(sub), in terms of the classification error rate and precision with other methods given in
Figure 1,
Figure 2,
Figure 3 and
Figure 4. A box plot of the remaining datasets in terms of the classification error rate and precision is shown in
Figures S1 and S2 of
Supplementary Materials.
The parameter W, which represents the percentage of the best trees selected for the final ensemble, is crucial for the OTEC(oob) and OTEC(sub) methods. The diverse values of W indicate the different behaviors of the method. This study assessed the impact of W = (10%, 30%, 50%, 70%, 90%) of the total number of trees on the classification method. As shown in
Figure 5, subsets of the total number of trees ranging from 10% to 90% were used to calculate the classification error rate. For each dataset,
Figure 5 shows that an increase in the number of trees used by the ensemble decreased the OTEC(oob) classification error rate. Similar to OTEC(oob),
Figure 6 shows that the classification error rate of OTEC(sub) decreased with an increase in the number of trees in the ensemble. In
Figure 5 and
Figure 6, the x-axis represents the percentage of the best trees selected, while the y-axis represents classification error rate results.