In this section, we evaluate our proposed FS method. We first summarize our main key setups to ensure the proper execution and construction of all experiments. These include the specifications of the datasets used for model evaluation, the tools employed in the experiments, and the performance measurements applied across all experiments. We then show the main evaluation results. In particular, we discuss comparisons between our approach with different wrapper-based and filter-based FS approaches, statistical analysis based on the Friedman test, convergence curves, and box plots. All these elements collectively contribute to a comprehensive understanding of our experimental process for evaluating our FS method and the main findings.
5.4. Evaluation Measurement
To evaluate the performance of the MBHO, we utilize several metrics such as the duration of the FS process,
,
, and the
score. These metrics, except for the duration of the FS process, are based on the confusion matrix, a crucial tool for gauging the effectiveness of machine learning algorithms. The confusion matrix consists of four main components: True Positive (
), True Negative (
), False Positive (
), and False Negative (
). In a scenario where we have a group of individuals, some of whom are infected with a specific disease,
refers to those who are indeed infected and correctly diagnosed, TN to those who are not infected and correctly identified as healthy,
to those who are healthy but mistakenly diagnosed as infected, and
to those who are infected but incorrectly identified as healthy. This classification task helps us understand the significance of these four components in the context of machine learning evaluation. Our main metrics are given as follows:
Statistical Significance: We employ the Wilcoxon signed-rank test, a non-parametric statistical method. This test is used to compare two paired observations (FS algorithms in our evaluation here) across multiple cases (here, datasets). The goal is to determine if there is a significant difference in the median values of these two observations [
144]. The test uses a
p-value and a null hypothesis to ascertain this difference. The null hypothesis, which assumes no significant difference between the two observations, is only rejected if the
p-value falls below a specified significance level
. Here, we consider the
as the main parameter for this statistical significance test. Friedman, as a non-parametric test, operates on non-uniform distribution values, thus bypassing the need for normality assumptions in decision-making. Its purpose is to determine if there are significant differences among at least three measurements within the same group of subjects concerning a skewed variable of interest, such as accuracy [
145].
5.5. Evaluation Results
Having shown the main experimental setup and evaluation measurement, we now show our main evaluation results and their related discussions.
Evaluation of the Case with No Correlation:Table 2 presents a comparison among MBHO (with no correlation) and BCS (with no correlation), evaluated across various datasets. Each dataset is represented by a row, and the columns detail different performance metrics and characteristics of the methods. These include the name of the dataset, the classification accuracy, the F1 score, the time taken for each algorithm to execute, and the number of features used by each one. By comparing these metrics, one can evaluate the performance of the two FS algorithms across different datasets and conditions, such as which method has higher accuracy or F1 score, which runs faster, or which uses fewer features. This can assist in selecting the most suitable method for a specific task or dataset. It is important to note that the time is here formatted as
hours: minutes: seconds: milliseconds, and the accuracy and F1 scores are likely proportions from 0 to 1. The features represent the number of variables used by the model to make its predictions, with a lower number indicating a less complex model, which can sometimes lead to better performance and interpretability. Moreover, according to the Wilcoxon test, these two algorithms are significantly different from each other due to the term of
at
.
In this comparison between MBHO (with no correlation) and BCS (with no correlation) algorithms over various datasets, MBHO generally outperforms BCS in terms of accuracy and F1 score, but it also tends to use more features and take a longer time to run. For instance, on the Colon and Darwin datasets, MBHO (with no correlation) achieves higher accuracy and F1 scores than BCS (with no correlation), despite using slightly more features and taking longer. On the Leukemia dataset, both methods perform equally in terms of accuracy and F1 score, but MBHO (with no correlation) takes longer and uses fewer features. Similar trends are observed on the Leukemia-3c and MLL datasets. On the WDBC, Sobar, and Divorce datasets, both methods perform equally in terms of accuracy and F1 score, but MBHO (with no correlation) takes longer and uses fewer features. For the SRBCT, Parkinsons, Sonar, and Urban datasets, MBHO achieves higher accuracy and F1 scores, but takes longer and uses more features. On the SpectTF and WPBC datasets, both methods have the same accuracy, but MBHO (with no correlation) has a slightly higher F1 score, takes longer, and uses more features. In total, MBHO (with no correlation) yields better accuracy in eight datasets of the fourteen datasets and a better F1 score in eight datasets of all datasets, compared to BCS (with no correlation). While MBHO (with no correlation) may provide a better performance, it may also be more computationally intensive and complex. Conversely, BCS (with no correlation) may be a more efficient choice in terms of runtime, but it may not perform as well as MBHO in terms of accuracy and F1 score.
Evaluation of Incorporating Proposed Correlation:Table 3 presents a comparison between MBHO (with correlation) and Binary CS (BCS) (with correlation)[
38], based on the Spearman correlation function and a significance level
. The metrics used for comparison are Accuracy (Acc), F1 score (F1), Time, and the number of Features. For the ‘Colon’ dataset, MBHO (with correlation) achieved an accuracy of 0.86 and an F1 score of 0.83 in 19:16:256 time with 1064 features, while BCS achieved an accuracy of 0.81 and an F1 score of 0.77 in 9:56:465 time with 1069 features. In the ‘Leukemia’ dataset, both MBHO and BCS achieved an accuracy of 0.99 and an F1 score of 0.94, but MBHO took more time (54:23:664) and used more features (1732) than BCS (31:49:948 time and 1730 features). For the ‘MLL’ dataset, both methods achieved an accuracy of 0.94 and an F1 score of 0.87, but MBHO took more time (14:32:17:233) and used more features (6500) than BCS (11:37:18:083 time and 6223 features). Lastly, for the ‘Divorce’ dataset, MBHO achieved an accuracy of 0.99 and an F1 score of 0.99 in 1:00:577 time with 14 features, while BCS achieved an accuracy of 0.98 and an F1 score of 0.98 in 35:622 time with 18 features. Moreover, according to the Wilcoxon test these two algorithms are significantly different from each other due to the term of
at
. In total, MBHO (with correlation) yields better accuracy in nine datasets of the fourteen datasets and better F1 scores in nine datasets. This shows the impact (enhancement) due to the correlation terms, proposed in this work, to enhance the fitness function used in the FS process.
Table 4 presents a comparison between two versions of the MBHO algorithm, categorized by their correlation status, detailing their performance across various datasets. The MBHO-correlation variant often outperforms the MBHO-no correlation counterpart in terms of
and
scores across several datasets. Notably, MBHO-correlation achieves higher
and
score in eight and five datasets, respectively, compared to MBHO-no correlation. However, for the SRBCT dataset, MBHO-no correlation achieves a better
score. In addition, both variants yield similar results in the remaining datasets for both
and
scores, further emphasizing the effectiveness of incorporating correlation into the MBHO algorithm. For example, in the Leukemia dataset, MBHO-correlation achieves an
of 99 % and an
score of 94 %, albeit with an execution time of 54 minutes and 23 seconds compared to MBHO-no correlation’s 9 minutes and 18 seconds. Similarly, in the MLL dataset, MBHO-correlation achieves higher
and
scores but has a higher time for the execution compared to MBHO-no correlation. These results suggest that considering correlation leads to more robust FS, enhancing classification performance, yet requiring more computational time. Conversely, in datasets with a small number of features such as WPBC, both variants yield similar performance metrics and select the same number of features. Each variant has its benefit depending on the user’s need (MBHO-no correlation for better time and MBHO-correlation for better accuracy).
Evaluation of Comparing MBHO with Filter-based FS:Table 5 presents the performance comparison between MBHO with the Spearman correlation function and three baseline filter FS approaches: MIM [
39], JMI [
40], and mRMR [
41]. Each method’s accuracy
and
score are evaluated across fourteen datasets. The results show that MBHO with the Spearman correlation function generally achieves higher
compared to the other methods. Specifically, in datasets such as Colon, Darwin, Leukemia, SRBCT, and Divorce, MBHO consistently outperforms the other methods in terms of both accuracy
and
score. Additionally, the ranking analysis indicates that MBHO with the Spearman correlation function has the highest sum of ranks, implying its overall superiority across the evaluated datasets. However, it is worth noting that MBHO may require more time to generate the final subset of features compared to the other methods since it is an approximation approach. Overall, these findings suggest that MBHO with the Spearman correlation function is a promising approach for feature selection tasks for different applications (captured by different datasets), offering competitive performance across diverse datasets.
Evaluation of Convergence Speed of Proposed Approach: In this study, we analyzed the behavior of the convergence curves generated by the suggested algorithms. These algorithms are our method variants (MBHO-no correlation and MBHO-correlation), along with BCS variants. The experiment have been conducted based on the average fitness score of each algorithm over a series of iterations (ranging from 1 to 25).
Figure 4 and
Figure 5 illustrate the convergence curve using the datasets provided, where the X-Axis denotes the iteration count, and the Y-Axis signifies the
value for each algorithm. According to
Figure 4 and
Figure 5, the proposed MBHO model has faster convergence than other algorithms due to its emphasis on maximizing the fitness score. Furthermore, the performance of the proposed algorithm has been enhanced (as shown in the figures by tending to find solutions with higher values) at each iteration. We emphasize that the mutation operation and correlation assessment have improved the performance (accuracy and F1) of the proposed algorithm by providing opportunities to explore the search space more extensively and replacing existing solutions with better ones, but led to slower overall time compared to BCS.
Evaluation of Distribution of Fitness Scores Across Iterations (Box Plots): In our research, we also examined the performance of the MBHO algorithm using the box plots produced. The box plots are used to visualize the distribution of fitness scores across different iterations. They help illustrate changes in the fitness scores over time, displaying the spread of values and central tendency. It provides important insights into the convergence behavior of the algorithm being used. The tests were carried out based on the mean fitness score of each algorithm across a range of iterations (from 1 to 25). The box plots are depicted in
Figure 6 and
Figure 7, which uses the provided datasets. The X-axis represents the number of iterations, while the Y-axis indicates the
value for each algorithm. To provide a clear analysis, we first introduce the key factors of the box plot. The box depicts the interquartile range, the line inside the box is the median value of the average fitness score, and the lines extending from the box indicate the variability outside the upper and lower quartiles. According to
Figure 6, MBHO with Spearman generally achieves the highest fitness scores across six datasets (of the eight datasets shown in this figure), although it exhibits considerable variability in certain datasets, suggesting fluctuating performance across different problem instances. Furthermore, according to
Figure 7, MBHO with Spearman tends to have the highest fitness scores across four datasets (of the remaining six datasets shown in this second figure), with some variability observed. However, this variability indicates fluctuations in the performance across different datasets, highlighting the importance of considering the specific characteristics of each problem instance. Yet this fluctuation indicates that even when MBHO-correlation starts with low-quality solutions, through iterative evolution, it tends to reach better areas in the search space over time. This highlights the adaptive and evolutionary nature of the algorithm, which can progressively improve its performance through iterations, even when faced with initially suboptimal solutions. Regarding the fluctuated performance, the proposed approach, MBHO achieves the highest fitness score and accuracy across most of the used dataset.