3.1. Ideals of Image Properties and Distribution across the Datasets
Images sizes and formats within the relevant plant disease datasets used in this research portrayed some versatile properties that has a potential of inflicting deep learning applications. In a more detailed form, these will be digested in the following sub-headings:
3.1.1. Data Imbalance
In order to create high-quality datasets for the detection of plant diseases, data balancing is crucial. It comprises making certain that a roughly equal fraction of each data class or category is represented. This holds particular importance in scenarios where certain classes, like particular plant diseases, might be less prevalent than others. Data balance is important because it allows the model to learn equally and fairly from every class, making the detection system more dependable and strong [
39].
The performance of the plant disease identification model may suffer from an imbalanced dataset in several ways, including where certain classes contain much fewer samples than others [
40]. Biased training is one such outcome. A presentation of the data imbalance across the datasets, with the respective annotations assigned in this study, could be outlined in a general form and summary through
Error! Reference source not found. A more specific analysis will be also presented for each dataset.
Table 1.
General dataset.
Table 1.
General dataset.
Annotation |
Dataset name |
Total Images |
Image Resolutions |
Setting |
Disease Classes |
Plants involved |
dataset_1 |
plant_village dataset |
54303 |
[256x256] throughout |
Lab |
38 |
Apple, Cherry, Corn, grape, Peach, Pepper, Potato, Strawberry, and Tomato |
dataset_2 |
a_database_of_leaf_images |
4503 |
[6000x4000] throughout |
Lab |
22 |
Mango, Arjun, Alstonia Scholaris, Guava, Bael, Jamun, Jatropha, Pongamia Pinnata, Basil, Pomegranate, Lemon, and Chinar |
dataset_3 |
RoCoLe dataset |
1560 |
[2048x1152], 768 images, [1280x720], 479 images, [4128x2322], 313 images |
Field |
5 |
Coffee |
dataset_4 |
FGVCx_cassava dataset |
537 |
Variable between [213x231] to [960x540] |
Field |
5 |
Cassava |
dataset_5 |
paddy_doctor dataset |
16,225 |
[1080x1440], 16219 images, and [1440x1080], 6 images |
Field |
13 |
Paddy |
Figure 1.
Data sixe disparity in PlantVillage Dataset (dataset_1).
Figure 1.
Data sixe disparity in PlantVillage Dataset (dataset_1).
Across different classes, the dataset_1 appears to be significantly unbalanced, with different image sizes and numbers. For example: 'tomato_yellow_leaf_curl_virus' has a greater number of images in the 'tomato' subfolders than other classes such as 'bacterial_spot' or 'leaf_mold'. In addition, ‘Healthy' subfolders—like 'cedar_apple_rust' and 'background_without_leaves'—occasionally included less photos than subfolders dedicated to certain diseases.
A class imbalance is also realized in dataset_2 between different plant species and their health statuses. For example, there are substantially different numbers of diseased and healthy pictures in classes like "arjun" and "alstonia_scholaris." Moreover, the inscribed 'Diseased' subfolders typically contain more photos in nearly every class than their 'healthy' counterparts.
The 'coffee' dataset in dataset_3 includes pictures of coffee plants in various states of health (healthy, red spider mite, rust levels 1–4). There is a notable disparity in class between the subfolders: 'healthy' has the greatest number of images, followed by 'rust_level_1' and 'red_spider_mite'. As rust gets less severe, there are increasingly fewer images for "rust_level_2," "rust_level_3," and "rust_level_4".
The Image data imbalance in dataset_4 is found to be significant where 'Healthy' is the class with the most images, followed by 'Brown_streak_disease', 'Green_mite', 'Mosaic_disease', and 'bacterial_blight', which each contain fewer images. Within a few disease classes, even certain resolutions appear to be more prevalent (
Figure 3).
As dataset_5 also follow the cue, there are a total of 2351 images in "blast" and 450 images in "bacterial_panicle_blight" respectively. In contrast to certain disease classes, the 'healthy' or 'healty' class, which represents healthy paddy plants, as annotated in the dataset’s nomenclature style, contains a significant number of images. Further variations in the class data could be observed in
Figure 4.
Due to variations and the smaller sample sizes, classes with fewer images may produce GLCM-based features that are less reliable, which could have an impact on model performance. Moreover, if unaddressed during model training, imbalanced classes may result in biased models that perform well on majority classes and badly on minority classes. Other factors that may hinder similar performances relevant to datasets’ feature extraction are defined in literature (Error! Reference source not found.). In the coming sections, an in-depth observation into these GLCM features is presented.
Figure 2.
Image Size Distribution in dataset_2.
Figure 2.
Image Size Distribution in dataset_2.
Figure 3.
Data Imbalance in FGVCx_cassava dataset (dataset_4).
Figure 3.
Data Imbalance in FGVCx_cassava dataset (dataset_4).
Figure 4.
Image data Imbalance in paddy_doctor dataset (dataset_5).
Figure 4.
Image data Imbalance in paddy_doctor dataset (dataset_5).
Table 2.
Factors fostering fine-grained extraction style in plant disease detection research.
Table 2.
Factors fostering fine-grained extraction style in plant disease detection research.
Factors |
Effects |
Source |
External factors such as uneven lighting, extensive occlusion, and fuzzy details |
Variations in the visual characteristics of affected plants. |
[41] |
Variations in the presence of illness and the growth of a pest |
Subtle differences in the characterization of the same diseases and pests in different regions, resulting in “intra-class distinctions”. |
[42] |
Similarities in the biological morphology and lifestyles of subclasses of diseases and pests |
Problem of “inter-class resemblance” |
[39] |
Background disturbances |
Makes it harder to detect plant pests and diseases In actual agricultural settings |
[43] |
3.1.2. Image Resolutions
For texture analysis techniques like Gray-Level Co-occurrence Matrix (GLCM) to ensure consistent feature extraction, resolutional consistency in pictures within a dataset is essential. More precision and dependability may be achieved in the extraction of texture information when photos have consistent resolutions. The resolutional consistency of datasets 1 and 2, for example, is demonstrated by the images, which are usually scaled at 256x256 or 256x192 pixels for dataset 1 and 6000x4000 pixels for dataset 2. With all of the images in these datasets, this consistency improves the dependability of GLCM analysis. Nevertheless, dataset_3 exhibits a significant range of dimensions, from 2048x1152 to 1280x720 pixels. Likewise, dataset_4 and dataset_5 have different resolutions of 213x213 to 960x540 pixels and 1080x1440 and 1440x1080 pixels, respectively. These variations in image resolutions between datasets could lead to unpredictability in GLCM measurements, which could affect texture analysis outcomes.
The differences in image resolutions found in datasets 3, 4, and 5 may result in inconsistent texture feature extraction, which could compromise the precision and dependability of GLCM analysis. Furthermore, because the models may find it difficult to generalize across images of different resolutions, these inconsistencies could provide problems for machine learning or deep learning models trained on these datasets. In order to ensure the robustness and generalizability of texture analysis results and ensuing machine learning models in the context of plant disease detection, resolutional discrepancies must be addressed.
3.2. Distribution of GLCM Metrics
Images are used to generate texture characteristics called GLCM (Gray-Level Co-occurrence Matrix) metrics, which show the spatial correlations between pixel intensities. Particularly when applied to plant disease image datasets, these metrics can be quite relevant for describing images since they capture various facets of texture. The 10 GLCM metrics distribution between the 5 different datasets were captured for the purpose of this research and presented in Error! Reference source not found. and Error! Reference source not found.. A general summary of these metrics is provided:
Energy represents the total of the GLCM's squared elements. Greater texture complexity or a wider range of pixel pairings in the image are indicated by higher energy ratings. It is computed using equation (1).
Where
is the (i, j)th entry in GLCM. While images in dataset_2 possess the highest energy level, dataset_5 has the lowest energy level (
Figure 5). As energy changes proportionately, there is an indication of the homogeneity in the image texture.
It measures the local variations existing in an image. High contrast values suggest a great difference between pixel intensities, hence indicating a more textured surface. It is computed using equation (2).
Where is the (i, j)th entry in GLCM.
Contrast distribution presented
Error! Reference source not found. (
Figure 5) indicates the lowest index for dataset_2, while dataset_1 and dataset_5 possess equal and highest levels. A linear variation of contrast could be further observed between dataset_3 to dataset_5 in a row. This could stand treasured in understanding how contrast-related features change across different datasets, potentially impelling the interpretation of plant disease images or the performance of image analysis algorithms.
Figure 5.
GLCM Metrics: Energy and Contrast.
Figure 5.
GLCM Metrics: Energy and Contrast.
It Indicates the linear dependency of gray levels in an image. High correlation values suggest a more linear association between pixel pairs. Correlation feature is computer using equation (3)
where , , and are the means and standard deviations of Px and Py
From the GLCM results distribution (
Figure 6), dataset_1 has the lowest correlation while dataset_2 and dataset_3 possesses the highest.
This is an indicator that reflects the closeness of the distribution of elements in the GLCM to the GLCM diagonal. High homogeneity highlights the uniformity in the image. It is given using the equation (4) below.
Dataset_2 has the highest correlation while dataset_4 shows the lowest (
Figure 6).
Figure 6.
GLCM Metrics: Correlation and Homogeneity.
Figure 6.
GLCM Metrics: Correlation and Homogeneity.
This represents the consistency or smoothness of an image. Higher ASM values indicate a more homogeneous texture. ASM is computed using equation (5).
In terms of meaning, energy is somehow similar to ASM. However, dataset_1 and dataset_2 has the highest ASM, while dataset_4 and dataset_5 shows the lowest (
Figure 7).
This represents the variance of the GLCM as it provides an overall view of the variance in the image texture. It's related to texture complexity; higher values indicate more complexity. It is computed using equation (6) below.
GLCM distribution results (
Figure 7) indicates that dataset_1 and dataset_2 possesses the highest and lowest total variance respectively.
Figure 7.
GLCM Metrics: ASM and Total Variance.
Figure 7.
GLCM Metrics: ASM and Total Variance.
Maximum Probability:
This represents the most frequently occurring intensity pair in the image, as in equation (7). Higher values of maximum probability indicate a dominant texture pattern within the image sets.
Difference Variance:
This measures the variance of the differences between adjacent pixel sets. It reveals alterations in intensity between neighboring pixels. Equation (8) shows the formula of difference variance.
Joint Entropy:
This reflects the amount of info or ambiguity present in the image. Higher joint entropy values indicate more randomness or less predictability in the image’s texture (
Figure 9).
Figure 9.
Joint entropy and Difference Variance.
Figure 9.
Joint entropy and Difference Variance.
Difference Entropy:
This reflects the randomness or unpredictability of the differences between adjacent pixel pairs. It is computed using equation (9) below.
3.3. Highest-Lowest GLCM Metric’s Scorecard
Texture analysis for plant disease diagnosis relies heavily on image quality control and dataset selection criteria, as seen by the observed differences in GLCM metric scores between datasets. Although GLCM metrics were uniform and constant in certain datasets, they varied significantly in others, which can affect the validity of texture-based characteristics that were derived from the images. To distinguish between the performances of individual datasets, with regards to being the highest or lowest per metrics generated, the scorecard presented in Error! Reference source not found. was developed.
The dataset_3 images include high-resolution variations spanning from 2048x1152 to 1280x720 pixels, indicating that image resolution may have an impact on texture analysis results. While lower resolution images may result in information loss and poorer texture differentiation, higher resolution pictures may capture finer texture features, leading to more nuanced GLCM measurements.
Table 3.
GLCM Metric's scorecard.
Table 3.
GLCM Metric's scorecard.
GLCM Metrics |
Highest in GLCM Metrics (dataset_x)
|
Lowest in GLCM Metrics (dataset_x)
|
1 |
2 |
3 |
4 |
5 |
1 |
2 |
3 |
4 |
5 |
Energy |
|
X |
|
|
|
|
|
|
|
X |
Contrast |
X |
|
|
|
X |
|
X |
|
|
|
Correlation |
|
X |
|
|
|
X |
|
|
|
|
Homogeneity |
|
X |
|
|
|
X |
|
|
|
|
Angular_Second_Moment |
X |
X |
|
|
|
|
|
|
X |
X |
Total_Variance |
|
|
|
|
X |
|
X |
|
|
|
Maximum_Probability |
X |
|
|
|
|
|
|
|
X |
|
Joint_Entropy |
|
|
|
|
X |
|
X |
|
|
|
Difference_Variance |
|
|
|
X |
|
X |
|
|
|
X |
Difference_Entropy |
|
|
|
|
X |
|
X |
|
|
|
Sum of scores |
3 |
4 |
0 |
1 |
4 |
3 |
4 |
0 |
2 |
3 |
Dataset_2 and dataset_5 scored the highest overall performance while dataset_2 showed the lowest overall scores per GLCM metrics.
Differences in image acquisition settings, environmental factors, and disease severity may be the cause of variations in GLCM metrics such as energy, contrast, and homogeneity between the respective datasets. For example, texture properties in images from lab-based datasets 1 and 2 might be more consistent because the images were taken under controlled conditions, while field-based datasets 3, 4, and 5 (referring to Error! Reference source not found.) presents more variability because of natural variations in plant physiology and environmental factors. This in turn calls for the need to conduct a more comprehensive analysis into the reasons why field-based datasets are lower in GLCM metrics score and whether this has an impact on machine/deep learning applications.
The observed discrepancies in GLCM measure scores among datasets bear significant consequences for the robustness and generalizability of machine learning and deep learning models that are trained on these datasets. When compared to models trained on datasets with significant variability in texture qualities, models trained on datasets with consistent GLCM metrics might perform better and be more generalizable. The capacity of the model to identify discriminative patterns linked to various disease classes may be hampered by bias and noise introduced into the feature space by inconsistent GLCM metrics between datasets. In practical applications, this can result in less-than-ideal performance and decreased dependability, especially when used in varied and ever-changing agricultural settings.
3.4. Correlation Matrix of GLCM Metrics
To ascertain the relationships between individual GLCM metrics of these datasets, a correlation matrix was generated and presented as a heatmap for easy visualization (
Figure 10). The GLCM metrics were ordered in the following series: Energy (1), Contrast (2), Correlation (3), Homogeneity (4), Angular Second Moment (5), Total Variance (6), Maximum Probability (7), Joint Entropy (8), Difference Variance (9) and Difference Entropy (10) respectively.
Figure 10.
Correlation Matrix of the derived GLCM Metrics.
Figure 10.
Correlation Matrix of the derived GLCM Metrics.
With the use of the correlation matrix, the link between variables by calculating values between -1 and 1 can be ascertained. Prior to developing any machine learning model, most data scientists believe that this is a decisive step to do as it helps determine which variables are most relevant for their model.
Strongly correlated metrics appeared darker in color. Results suggested pairwise correlation between the following metrics:
Higher values of Energy are generally correlated with higher values of Angular Second Moment, according to a strong positive correlation observed. In accordance with this, images with more uniform pixel values also tend to be more homogeneous. This has the potential of helping machine learning algorithms to identify patterns or homogeneity in textures. Moreover, understanding this correlation is key for feature selection in image processing. Consequently, supposing either of the two metrics is highly representative, using both may not provide additional information.
The maximum probability of pixel pairs tends to grow as the uniformity (Energy) of the image increases, due to a strong association observed. This indicates that more uniform plant disease images are likely to have a pixel pair that occurs more regularly than others. For machine learning and deep learning applications, this could impact tasks where the existence of specific pixel pairs is essential, such as in identifying unique texture patterns in certain diseases.
There appears to be a considerable association between the likelihood of a certain pixel pair recurring frequently and the homogeneity of the image. This could mean that certain patterns or textures appear frequently and consistently across the image. For applications requiring a given texture pattern to occur frequently and to be uniform, such as diagnosing diseases based on recurring patterns will be worthwhile.
A substantial positive correlation suggests that that as the information content (Joint Entropy) of an image increases, the randomness in intensity differences (Difference Entropy) also increases. More information-rich images may show a wider range of intensity variations. More information-rich images may show a wider range of intensity variations. This implies that the information content (Joint Entropy) of an image is related to the distribution of pixel intensities (Maximum Probability) classified within itself. Images with different distributions of pixel intensity could have more entropy. In terms of deep learning, this is cherishable for tasks where understanding both the randomness in intensity differences and the overall information content is decisive, e.g., in tasks requiring diverse texture patterns.
- 2)
Moderate Correlations:
According to a reasonable correlation between Joint Entropy and Contrast, there tends to be a corresponding increase in the intensity difference between adjacent pixels (Contrast) as the individual plant disease image's information content rises (greater Joint Entropy). This correlation further implies that images with higher entropy (more varied pixel pair intensities) also seem to have more noticeable contrasts in intensity between neighboring pixels.
This correlation may be useful for activities where it is important to comprehend both the overall information content and the fluctuations in local intensity. For example, it could be helpful to capture different texture patterns at the global and local levels in disease identification.
A moderate link has been found between Difference Entropy and Contrast, indicating that a rise in the unpredictability of intensity differences between pixels (higher Difference Entropy) is accompanied by an increase in the intensity difference between a pixel and its neighbors (Contrast). According to this correlation, images with more diverse intensity differences between individual pixels also typically exhibit more pronounced intensity differences between neighboring pixels.
The Effect on Machine Learning/Deep Learning Applications could be linked to tasks where it is necessary to capture both the local changes in intensity and the global randomness in intensity differences. This correlation may aid in the identification of various texture patterns with unique local properties in the context of plant disease image analysis.
A moderately positive correlation between these measures indicates that there is a tendency for stronger correlation to be associated with higher homogeneity. This possibly will suggest, in the context of machine/deep learning, that textures with more homogeneity (homogeneity) shows a relationship with higher correlations between pixel values at various spatial distances, thus signifying a texture that is more predictable.
- 3)
Weak Correlations
The low correlation seen between these metrics may suggest that the information contained in these differences (Difference Entropy) is not highly correlated with fluctuations in pixel differences across spatial distances (Difference Variance). This could imply, in terms of machine learning, that although pixel disparities vary, they may not significantly add to the image's data content.
- 4)
Inverse Correlation
The negative correlation suggests that homogeneity and entropy are inversely proportional. That is to say as homogeneity seems to drop as Joint Entropy rises. This could imply that images tend to be less homogeneous when their entropy is larger. This may suggest that images with a wider range of pixel intensities have less homogeneity in machine learning.
- 5)
Kruskal-Wallis test of Variance
To find out whether there are statistically significant differences between the GLCM metrics, Kruskal-Wallis test for non-parametric data was utilized. This is due to the violation of the datasets to ANOVA assumptions, including normality of data and its hypotheses, as insights from the presented distribution of the parameters across the datasets portrays. In this case a statistical significance is indicated by a p-value less than 0.0001. Given the size of the datasets, the p-value is determined to be estimated rather than exact. Moreover, the shown multiple stars signify much more of the said significance. It's true that there is a significant difference (P < 0.05) in the medians, as well, between the 10 GLCM metrics. Kress-Wallis Statistic value estimated as 619192 is given herein as the determined test statistic value.
In light of the findings as summarized in Error! Reference source not found., machine/deep learning applications could be impacted owing to the significant differences in GLCM metrics across plant disease datasets, through the feature relevance of these metrics might be decisive in distinguishing plant diseases. Integrating these metrics through advanced modellings could significantly impact classification accuracy. Also, tailoring the deep learning or machine learning algorithms, as the case may be, to accommodate these differences has the potential to enhance their performance. Such algorithms could adapt their weights or learning rates based on the dataset's specific characteristics shown by its GLCM metrics. Nevertheless, it’ll be crucial to understand which metrics vary significantly alongside the individual correlations between them. This will viable help allow targeted training through a fine-tuning style or separate training for each dataset, hence optimizing their analytical power for specific diseases.
Table 4.
Kruskal-Wallis test.
Table 4.
Kruskal-Wallis test.
Table Analyzed |
GLCW 10 parameters |
Kruskal-Wallis test |
|
P value |
<0.0001 |
Exact or approximate P value? |
Approximate |
P value summary |
**** |
Do the medians vary signif. (P < 0.05)? |
Yes |
Number of groups |
10 |
Kruskal-Wallis statistic |
619192 |
Data summary |
|
Number of treatments (columns) |
10 |
Number of values (total) |
637010 |
Furthermore, metrics showing substantial unevenness or variations might correlate with disease severity. The severity of plant diseases based on image features will potentially help in accurately assessing these variations if properly leveraged. This could be true for enhancing the robustness of future machine learning and deep learning model’s robustness thus a better handle diverse conditions and variations observed in different diseases.
3.5. Deep Learning Model’s Development and Analysis
To maximize performance, the model was trained through a number of epochs (10) kept fixed for the whole datasets. The training dataset is traversed entirely in each epoch, and the model's parameters are updated using mini-batches of data throughout each iteration. To guarantee the best results, hyperparameters including learning rate, batch size, and regularization strategies were carefully adjusted (
Figure 12).
As it is known, the input data size significantly affects the model performance. Like other hyperparameters, the number of inputs should be the same for all data sets in order to make a fair evaluation. For this purpose, the class with the lowest number of samples was determined after all data sets were converted into separate image data stores with a Matlab code. Accordingly, the subclass labeled "Leamon (P10)_deseased" of dataset D2 has the lowest number of image samples (77 images). Using this threshold value, 77 images were randomly selected from the subclasses of each dataset to obtain the final image datastores. Then, 80% of the data was randomly allocated for training and the remaining 20% for testing. The stages of preparing the datasets are shown in
Figure 11. Results obtained from the deep learning training are presented in
Error! Reference source not found. The average accuracy, precision, recall, and F1-score measures show differences in the deep learning models' performance across the five datasets (D1–D5).
Figure 11.
The stages of preparing final datasets with equal class-image numbers.
Figure 11.
The stages of preparing final datasets with equal class-image numbers.
Figure 12.
Training Phase for D1.
Figure 12.
Training Phase for D1.
Training on the D1 and D2 datasets, the deep learning model demonstrated highly encouraging results on several assessment criteria, averaging 91.22% and 90.6 % average testing accuracy, respectively. This suggests that there was a strong agreement between the models’ predictions and the ground truth labels for the test samples. The models’ capacity to properly classify diseased and healthy plant image samples while limiting false positives and false negatives was demonstrated by the average precision, recall, and F1-scores. The models found an effective compromise between accurately recognizing unhealthy plants (recall) and minimizing misclassifications (precision), as evidenced by the high values of average precisions, recalls, and F1-scores, respectively. Generally, when working with imbalanced datasets, the F1-score is very helpful as it provides a thorough assessment of the model's overall performance by taking into account both precision and recall [
21,
44].
Table 5.
Testing Performance Metrics of the datasets in the Deep learning.
Table 5.
Testing Performance Metrics of the datasets in the Deep learning.
Datasets |
Av. accuracy |
Av. precision |
Av. recall |
Av. F1-score |
D1 |
0.9122 |
0.9141 |
0.9123 |
0.9111 |
D2 |
0.9060 |
0.9116 |
0.9061 |
0.9056 |
D3 |
0.6666 |
0.7329 |
0.6667 |
0.6411 |
D4 |
0.5866 |
0.5885 |
0.5867 |
0.5867 |
D5 |
0.5897 |
0.5996 |
0.5897 |
0.5852 |
As evidenced by the confusion matrix derived for D1 (
Figure 13), a 100% accuracy for 15 out of the 38 classes were recorded. The lowest was recorded as 9 accurate predictions thus indicating the suitability of both the dataset and model utilized (darkNet19) since it has minimal errors. The average accuracy of D1 and D2 being higher than that of D3, D4, and D5, respectively, implies that in terms of accurately classifying disease classifications, the models trained on datasets D1 and D2 performed better overall.
Figure 13.
Confusion Matrix of D1.
Figure 13.
Confusion Matrix of D1.
The models trained on the D3, D4, and D5 datasets have difficulty with classification tasks, as evidenced by the decrease in accuracy scores found for these datasets. This could be because of inadequate representation of disease classes or perhaps the noise.
Additionally, D1 and D2 have greater average precision values than D3, D4, and D5, suggesting that fewer false positives (misclassifications) occurred in the models trained on these datasets. As evidenced by the confusion matrix derived for D2 (
Figure 14), a 100% accuracy for 7 out of the 22 classes were noted. The lowest was recorded as 11 accurate predictions leaving 4 classes inaccurate, hence indicating the suitability and quality of the dataset for plant disease detections. The reduced precision values for D3, D4, and D5 point to an increased likelihood of false positives in the predictions, which caused much of the disease classes to be incorrectly identified (
Figure 15,
Figure 16 and
Figure 17) indicted high accuracy in classifying the “disease” class while low performance in the case of “healthy” class. This made it clear that low number of classes in dataset is significant in its accuracy for deep learning applications.
Figure 14.
Confusion Matrix for D2.
Figure 14.
Confusion Matrix for D2.
Figure 15.
Confusion Matrix for D3.
Figure 15.
Confusion Matrix for D3.
Figure 16.
Confusion Matrix D4.
Figure 16.
Confusion Matrix D4.
Figure 17.
Confusion Matrix D5.
Figure 17.
Confusion Matrix D5.
Comparing models trained on D3, D4, and D5 to those trained on D1 and D2, the former show greater average recall values, suggesting fewer false negatives (missed detections). The lower recall scores for D3, D4, and D5 indicate a higher proportion of false negatives, implying that the models may have failed to identify incidences of disease classes in the dataset.
The average F1-scores, which incorporate recall and precision, are greater for D1 and D2. This shows that the models trained on these datasets have a superior balance between precision and recall. The F1-score, in particular, suggests a good trade-off between precision and recall, is considered so essential for accurate disease detection [
44]. It is possible that the class imbalances across these datasets are to blame for the lower F1-scores for D3, D4, and D5, which show a less than ideal balance between precision and recall.