1. Introduction
Pepper holds a prominent position in global agriculture, with China standing as the largest producer and consumer, accounting for 37% of the total pepper planting area worldwide. As an indispensable vegetable in our daily lives, pepper plants are highly vulnerable to diseases, particularly affecting the leaves and leading to the occurrence of frontal diseases [
1]. The health and productivity of pepper plants are significantly impacted when the timely detection and control of pepper leaf diseases are neglected, resulting in substantial economic losses [
2]. To combat these diseases, the primary approach is the application of pesticides [
3]. However, the conventional methods of agricultural chemical application often overlook the severity of the disease, leading to inconsistent dosages administered across different areas, with some areas receiving insufficient treatment while others experience excessive chemical exposure [
4,
5]. Moreover, the incidence of pepper leaf diseases has shown a consistent increase over time. Inaccurate application of agricultural chemicals not only contributes to environmental pollution but also hampers the effectiveness of disease treatment [
6]. In practice, planters typically resort to manual identification of disease spots and the assessment of disease severity. However, this manual identification process is labor-intensive and prone to subjective misunderstandings, introducing a risk of misinterpretation [
7]. In view of this, numerous researchers have proposed intelligent recognition methods for plant diseases, and they have used CNN models to classify plant images with good recognition results [
8,
9,
10,
11,
12,
13]. These plant images are categorized based on their background into two types: natural background images and pure background images. Natural background images are obtained under plants' natural growth conditions, which often result in complex backgrounds characterized by high levels of uncertainty. In contrast, pure background images are captured against a steady backdrop, such as a tabletop or the ground, facilitating easier and more reliable disease identification. However, the backgrounds of these pure background images are usually highly individualistic and they are often difficult to replicate in the field, posing a significant challenge to the robust performance of the recognition models. Therefore, accurate extraction of diseased leaves plays a crucial role in intelligent plant disease recognition, which can remarkably improve the robust performance of disease recognition models. Among various techniques, image segmentation methods provide a direct and effective means of extracting pepper leaves, thereby laying the foundation for monitoring and diagnosing areas affected by pepper leaf diseases.
Advances in imaging technology have greatly facilitated the use of image segmentation techniques for plant leaf extraction, enabling agricultural experts to analyze plant growth based on various leaf image features [
14,
15,
16]. Numerous approaches have been proposed for the segmentation of plant leaf images. Threshold-based methods, such as the fuzzy C-means algorithms [
17], have been used to iteratively determine the optimal threshold for leaf image segmentation. Histogram intensity-based threshold methods [
18,
19], employing histogram bimodal and OSTU algorithms, had also been utilized for segmenting given leaf images. However, these threshold-based approaches faced challenges when dealing with complex images. Region-based methods, such as the region-based level set method, the region growth method [
20], the wavelet method [
21], have demonstrated high accuracy and fast runtime for plant leaf segmentation. Although these methods have achieved satisfactory results to some extent, their effectiveness heavily relied on image features, limiting their widespread applicability. Clustering-based methods, utilizing fuzzy k-means clustering [
22], have been employed to compute clustering centers for leaf segmentation. However, these methods may encounter difficulties in escaping local optima, resulting in lower segmentation accuracy.
Deep learning-based technologies have made remarkable strides in the field of computer vision, bringing about significant advancements in agriculture applications [
23]. Notably, the convolutional neural network [
24] and U-Net architectures [
25] have gained widespread application. Afterwards, U-Net and its variants have been presented to improve the segmentation by fusing different mechanisms, such as the attention U-Net [
26], UNet++ [
27], and the pyramid attention network [
28]. The transformer based networks (Dosovitskiy et al., 2020) utilized the self-attention mechanism to build long-term relationships of dependency and could obtain competitive results in image recognition. It was noted that the transformer-based model [
29,
30,
31,
32] had mainly focused on improving the ability to extract the global context information and ignored the detailed information. MLP-Mixer [
33] showed that pure MLP-based networks could achieve competitive performance in image segmentation since MLP can replace the self-attention mechanism in some extent. The cycle MLP network (RepMLPNet) [
34] used local injection to merge the local priority into the fully connected layer, which provided a solution to extract detailed information.
Inspired by MLP-based models in replacing the self-attention mechanism, we propose a novel approach called Adaptive Multi-Scale MLP (AMS-MLP) network for pepper leaf segmentation. The AMS-MLP network follows an encoder-decoder architecture by combining the Multi-Path Aggregation Mask (MPAM) module and the Multi-scale Context Relation Decoder (MCRD) module. Moreover, to facilitate the fusion of global and local information between the encoder and decoder, an Adaptive Multi-Scale Global and Local MLP (AMSGL-MLP) module is designed to replace each skip connection layer. In the AMS-MLP network, the AMSGL-MLP module overcomes the limitations associated with the inductive bias of convolution layers, which can deal with the global information and progressively fuse discrete local details. Additionally, the MCRD module enables our model to emphasize the border relationship between the foreground and the background, especially at segmented edges. Our contributions are as follows:
1) We present a novel segmentation framework designed specifically for accurate pepper leaf segmentation in diverse backgrounds. This framework effectively extracts pepper leaves from images containing various pure backgrounds. Notably, different from the previous framework, a five-layer aggregation feature is utilized to generate a single-channel mask for refining the boundaries of the segmented pepper leaves.
2) We propose an AMSGL-MLP module based on the self-attention mechanism for automatic extraction of multi-scale features. The AMGL-MLP module employs two branches consisting of a Global Multi-Scale MLP (GMS-MLP) branch and a Local Multi-Scale MLP (LMS-MLP) branch to extract both global and local feature maps. By utilizing an attention mechanism, the module dynamically adjusts the weights assigned to the global and local features, facilitating the effective fusion of global and local information.
3) The MCRD module is proposed, which combines adjacent scale features using an attention mechanism to generate enhanced boundary features and contextual information for the segmented target.
4) Extensive experiments conducted on the pepper leaf dataset demonstrate the superiority of the proposed model over state-of-the-art (SOTA) methods.
The remainder of the paper is organized as follows.
Section 2 provides an overview of related work, encompassing CNN-based and MLP-based models in semantic segmentation.
Section 3 details the specific network architecture employed in our approach.
Section 4 describes the experimental settings and presents the obtained results. Finally,
Section 5 presents the conclusions drawn from this research.
5. Conclusion
The robust performance of current plant disease recognition models is poor due to the diversity of plant leaf image backgrounds. Therefore, accurate extraction of plant leaves from the background is highly desirable for plant disease recognition. In this paper, we propose a lightweight and accurate leaf segmentation model for extracting pepper leaves from diverse backgrounds. Specifically, we design an adaptive multi-scale MLP network by combining the MPAM module and the MCRD module for pepper leaf segmentation. It consists of an encoder network, an AM-MLP module, and a decoder network. Within the encoder network, the MPAM module is employed to aggregate features from five layers and generate a single-channel mask, enhancing the accuracy of pepper leaf boundary extraction. In the AM-MLP module, the GMS-MLP branch extracts global features while the LMS-MLP branch focuses on capturing local feature maps. Additionally, we employ an adaptive attention module to dynamically extract the features of the global and local branches. The decoder network incorporates the MCRD module into the convolutional layer, which enhances the ability of boundary extraction. To validate the generalizability of our proposed approach, extensive experiments are conducted on three pepper leaf datasets, and the results reveal mIoU scores of 97.39%, 96.91%, and 97.91%, as well as F1-score of 98.29%, 97.86%, and 98.51%, respectively. Meanwhile, the ablation experiments are conducted by gradually integrating three modules, namely AM-MLP, MPAM, and MCRD, into the baseline model. The results in
Table 5 demonstrate significant improvements in segmentation performance across six evaluation metrics.
Although our proposed AMS-MLP network is able to segment pepper leaves effectively, the method is based on full supervision and requires a large number of training samples with labeling. In future work, we plan to explore a weakly supervised or self-supervised segmentation method for pepper leaf segmentation. On the other hand, we investigate fine-tuning methods on existing deep learning-based models to enhance the generalisation ability of the models and provide more effective and feasible solutions for pepper leaf images in different scenarios
Figure 1.
The sample images in different pure backgrounds.
Figure 1.
The sample images in different pure backgrounds.
Figure 2.
Overview of the AMS-MLP framework including the encoder network, the adaptive multi-scale MLP (AM-MLP) module, and the decoder network. The encoder network comprises five convolutional layers incorporating four downsampling operations and a multi-path aggregation mask (MPAM) module. The decoder network comprises five convolutional layers, incorporating four upsampling layers and three MSRD modules. The AM-MLP module is used for the skip connection layer.
Figure 2.
Overview of the AMS-MLP framework including the encoder network, the adaptive multi-scale MLP (AM-MLP) module, and the decoder network. The encoder network comprises five convolutional layers incorporating four downsampling operations and a multi-path aggregation mask (MPAM) module. The decoder network comprises five convolutional layers, incorporating four upsampling layers and three MSRD modules. The AM-MLP module is used for the skip connection layer.
Figure 3.
The network architecture of the AM-MLP module. The input feature map F is split into the global multi-scale MLP (GMS-MLP) branch FG and the local multi-scale MLP (LMS-MLP) branch FL. After each branch with multiple Cascade MLP blocks, the resulting features are alternately multiplied to enhance information interaction and then added together. Then, multi-scale features and local information are automatically extracted using an adaptive attention mechanism.
Figure 3.
The network architecture of the AM-MLP module. The input feature map F is split into the global multi-scale MLP (GMS-MLP) branch FG and the local multi-scale MLP (LMS-MLP) branch FL. After each branch with multiple Cascade MLP blocks, the resulting features are alternately multiplied to enhance information interaction and then added together. Then, multi-scale features and local information are automatically extracted using an adaptive attention mechanism.
Figure 4.
Illustration of the GMS-MLP and LMS-MLP modules. As an example, we used (W = 16; H = 16) as input, where B is the batch size, and C is the channel number. Input feature will be processed by GMS-MLP and LMS-MLP branches. In the GMS-MLP branch, the feature map is initially divided into non-overlapping patches of size 2 × 2, resulting in a grid of size 8× 8. These patches are then flattened and fed into a fully connected (FC) layer along the first axis. Finally, the output is reshaped back and ungridded to restore the original size. In the LMS-MLP branch, the feature map is divided into non-overlapping patches of size 8 × 8, resulting in a blocking of size 2 × 2. These patches are flattened and processed through an FC layer along the second axis. Following that, the output is reshaped back and unblocked to regain the original size, resulting in the feature map .
Figure 4.
Illustration of the GMS-MLP and LMS-MLP modules. As an example, we used (W = 16; H = 16) as input, where B is the batch size, and C is the channel number. Input feature will be processed by GMS-MLP and LMS-MLP branches. In the GMS-MLP branch, the feature map is initially divided into non-overlapping patches of size 2 × 2, resulting in a grid of size 8× 8. These patches are then flattened and fed into a fully connected (FC) layer along the first axis. Finally, the output is reshaped back and ungridded to restore the original size. In the LMS-MLP branch, the feature map is divided into non-overlapping patches of size 8 × 8, resulting in a blocking of size 2 × 2. These patches are flattened and processed through an FC layer along the second axis. Following that, the output is reshaped back and unblocked to regain the original size, resulting in the feature map .
Figure 5.
Illustration of the multi-scale context relation decoder (MCRD) module. Two feature maps and are input into the MCRD module, the high features is first performed on the upsampling operation. The generated feature maps pass through the sigmod activation function and a convolutional operation, which generates the mask maps representing the foreground and background regions. .
Figure 5.
Illustration of the multi-scale context relation decoder (MCRD) module. Two feature maps and are input into the MCRD module, the high features is first performed on the upsampling operation. The generated feature maps pass through the sigmod activation function and a convolutional operation, which generates the mask maps representing the foreground and background regions. .
Figure 6.
Illustration of the multi-path mask decoder module. From the fifth to second layers, the feature maps in the encoder are first passing a convolutional operation to suppress the channel number, and the generated channel number of the output features is the same to that of the first layer in the encoder.
Figure 6.
Illustration of the multi-path mask decoder module. From the fifth to second layers, the feature maps in the encoder are first passing a convolutional operation to suppress the channel number, and the generated channel number of the output features is the same to that of the first layer in the encoder.
Figure 7.
Qualitative comparison of the proposed model compared with six models on the EBD dataset, and five examples of the predicted results are shown. From the 1st column to 9th column: the original image, the predicted results corresponding to FCN-VGG16, U-Net, AttUNet, UNet++, UNeXt, CM-MLP, our model, and the ground truth, respectively.
Figure 7.
Qualitative comparison of the proposed model compared with six models on the EBD dataset, and five examples of the predicted results are shown. From the 1st column to 9th column: the original image, the predicted results corresponding to FCN-VGG16, U-Net, AttUNet, UNet++, UNeXt, CM-MLP, our model, and the ground truth, respectively.
Figure 8.
Qualitative comparison of the proposed model compared with six models on the BSD dataset, and five examples of the predicted results are shown. From the 1st column to 9th column: the original image, the predicted results corresponding to FCN-VGG16, U-Net, AttUNet, UNet++, UNeXt, CM-MLP, our model, and the ground truth, respectively.
Figure 8.
Qualitative comparison of the proposed model compared with six models on the BSD dataset, and five examples of the predicted results are shown. From the 1st column to 9th column: the original image, the predicted results corresponding to FCN-VGG16, U-Net, AttUNet, UNet++, UNeXt, CM-MLP, our model, and the ground truth, respectively.
Figure 9.
Qualitative comparison of the proposed model compared with six models on the MLD dataset, and five examples of the predicted results are shown. From the 1st column to the 9th column: the original image, the predicted results corresponding to FCN-VGG16, U-Net, attention U-Net (AttUNet), UNet++, UNeXt, CM-MLP, our model, and the ground truth, respectively.
Figure 9.
Qualitative comparison of the proposed model compared with six models on the MLD dataset, and five examples of the predicted results are shown. From the 1st column to the 9th column: the original image, the predicted results corresponding to FCN-VGG16, U-Net, attention U-Net (AttUNet), UNet++, UNeXt, CM-MLP, our model, and the ground truth, respectively.
Table 1.
The distribution of the four image datasets.
Table 1.
The distribution of the four image datasets.
Dataset |
Test |
Training |
Validation |
Total |
Early Blight Dataset (EBD) |
163 |
865 |
162 |
1190 |
Brown SpotDataset (BSD)
|
186 |
1015 |
184 |
1385 |
Mixed Leaf Dataset(MLD) |
1323 |
4629 |
661 |
6613 |
Table 2.
The results of segmenting the EBD dataset using seven different models.
Table 2.
The results of segmenting the EBD dataset using seven different models.
Model |
Accuracy (%) |
Recall (%) |
Specificity (%) |
Precision (%) |
mIoU (%) |
F1-score (%) |
FCN |
FCN-16s |
99.45 |
97.33 |
99.79 |
98.67 |
97.04 |
98.00 |
UNet-based |
U-Net |
99.53 |
97.31 |
99.85 |
98.92 |
87.47 |
98.11 |
AttU-Net |
99.26 |
96.29 |
99.74 |
98.38 |
96.06 |
97.33 |
UNet++
|
99.43 |
97.04 |
99.82 |
98.87 |
96.95 |
97.94 |
MLP-based |
UNeXt |
99.31 |
96.31 |
99.79 |
98.67 |
96.38 |
97.48 |
CM-MLP |
99.44 |
97.41 |
99.77 |
98.54 |
96.96 |
97.97 |
Ours
|
99.53 |
97.61 |
99.84 |
98.97 |
97.39 |
98.29 |
Table 3.
The results of segmenting the BSD dataset using seven different models.
Table 3.
The results of segmenting the BSD dataset using seven different models.
Model |
Accuracy (%) |
Recall (%) |
Specificity (%) |
Precision (%) |
mIoU (%) |
F1-score (%) |
FCN |
FCN-16s |
99.69 |
97.17 |
99.87 |
98.11 |
96.62 |
97.64 |
UNet-based |
U-Net |
98.83 |
93.85 |
99.18 |
88.95 |
96.14 |
91.33 |
AttU-Net |
99.62 |
98.05 |
99.73 |
96.26 |
95.97 |
97.14 |
UNet++ |
99.37 |
98.08 |
99.46 |
92.68 |
95.75 |
95.31 |
MLP-based |
UNeXt |
99.37 |
97.70 |
99.49 |
93.06 |
94.66 |
95.33 |
CM-MLP |
99.66 |
96.95 |
99.85 |
97.83 |
95.68 |
97.39 |
Ours |
99.72 |
97.47 |
99.88 |
98.26 |
96.91 |
97.86 |
Table 4.
The results of segmenting the MLD dataset using seven different models.
Table 4.
The results of segmenting the MLD dataset using seven different models.
Model |
Accuracy (%) |
Recall (%) |
Specificity (%) |
Precision (%) |
mIoU (%) |
F1-score (%) |
FCN |
FCN-16s |
99.61 |
96.93 |
99.89 |
98.94 |
97.10 |
97.92 |
UNet-based |
U-Net |
99.46 |
95.40 |
99.88 |
98.80 |
96.19 |
97.07 |
AttU-Net |
99.57 |
96.43 |
99.90 |
99.03 |
97.05 |
97.71 |
UNet++
|
98.87 |
89.58 |
99.83 |
98.25 |
92.15 |
93.71 |
MLP-based |
UNeXt |
99.20 |
92.84 |
99.86 |
98.56 |
94.24 |
95.61 |
CM-MLP |
99.71 |
98.02 |
99.88 |
98.85 |
97.32 |
98.44 |
Ours
|
99.72 |
97.79 |
99.92 |
99.24 |
97.91 |
98.51 |
Table 5.
The Compared results for the ablation experiment of pepper leaf segmentation.
Table 5.
The Compared results for the ablation experiment of pepper leaf segmentation.
Model |
Accuracy (%) |
Recall (%) |
Specificity (%) |
Precision (%) |
mIoU (%) |
F1-score (%) |
Baseline(BU-Net)
|
97.65 |
93.92 |
79.11 |
97.96 |
97.74 |
85.88 |
BAM-MLP (BU-Net+AM-MLP)
|
98.37 |
96.70 |
84.20 |
98.51 |
98.06 |
90.02 |
BMAM-MLP (BU-Net+MPAM+AM-MLP)
|
99.04 |
96.78 |
91.36 |
99.25 |
96.88 |
93.87 |
Ours (BU-Net+MPAM+AM-MLP+MCRD)
|
99.63 |
97.98 |
97.20 |
99.77 |
98.28 |
97.59 |