5.2.1. Convolutional Neural Networks (CNN)
Convolutional neural networks (CNN) have been used extensively in analyzing images for precision agriculture. In specific, transfer learning has been often used successfully using a variety of pre-trained models including Inception V3, VGG, etc. For example, Crimaldi et al. [
60] used Inception V3 model and achieved 78.1% accuracy for classifying a crop into one of 14 crop types using data consisting of 54,309 images. Milioto et al. [
61] built a CNN model using RGB and NIR camera images. The model had 97.3% accuracy for images of early crop growth and 89.2% accuracy for images of crops in later stages. However, both models had the same recall percentage, with the early stage scoring 98% and the later stage scoring 99%. Similarly, Bah et al. [
62] used the AlexNet model on spinach, beet, and beans datasets, and got a precision of 93%, 81%, and 69%, respectively. The authors claimed that the bad results were primarily due to leaves overlapping between crops and weeds. Reddy et al. [
63] used a customized CNN model for their work on plant species identification and achieved 99.5% precision for Flavia, Swedish leaf, and UCI leaf datasets. Sembiring et al. [
64] focused on tomato plant disease detection. Their proposed model achieved 97.15% validation accuracy using the tomato leaf dataset from Plant village. However, their model did not achieve the highest validation accuracy among all four trained models. The highest accuracy score of 98.28% was achieved by the VGG16 model. Geetharamani et al. [
65] achieved a classification accuracy of 96.46% using a customized nine-layer CNN model. R. et al., [
66] used a residual learning CNN with an attention mechanism. The goal was to perform real-time corn leaf disease recognition. They also used the Plant Village Disease Classification challenge dataset. An overall accuracy of 98% was achieved. Nanni et al. [
67] used different combinations of CNNs, including ResNet50, GoogleNet, ShuffleNet, MobileNetv2, and DenseNet201, with different Adam optimization methods. These CNN models were trained on three datasets of insect images: The Deng dataset, the IP102 dataset, and the Xie2 dataset. The best performing CNN achieved state-of-the-art accuracy on both insect datasets: 95.52% on Deng, a score that competed with human expert classifications, and 73.46% on IP102.
Atila et al. [
68] proposed using the EfficientNet architecture for plant disease classification on the plant village dataset and achieved 99.91% and 99.97% accuracy on original and augmented datasets, respectively. Prasad et al. [
69] proposed a two-step machine learning approach that analyzed low-fidelity and high-fidelity images from drones in sequence, preserving efficiency as well as the accuracy of plant diagnosis. The pathology 2020 dataset and a set of synthetically generated images were used. A semi-supervised model derived from EfficientNet called EfficientDet was used. The end goal was to perform segmentation and classification. The model scored 75.5% for the average accuracy of the identifier model. Albattah et al. [
70] proposed a customized model of using EfficientNet called EfficientNetV2-B4 backbones to address plant disease classification. The Plant Village dataset and additional UAV images were used to train the model. The results were 99.63%, 99.93%, 99.99%, 99.78% for precision, recall, accuracy, and F1-score, respectively.
Mishra et al. [
71] developed a standard CNN model to detect corn plant diseases in real-time. The model was deployed on an Intel Movidius NCS and a Raspberry Pi 3b+ module. The authors used the Plant-Village Disease Classification Challenge dataset and divided the images into three classes: Rust, Northern leaf blight, and Healthy. The system achieved an accuracy of 98.40% using a GPU and 88.56% on the NCS chip. Bah et al. [
72] used unsupervised data labeling for weed detection from UAV images. The dataset consisted of two fields: Beans and Spinach. Each dataset was divided into the two classes or crop and weed. Two-thirds of the data was labelled in a supervised manner, while the third was labelled using unsupervised methods. The ResNet-18 model was used to perform the classification. The ResNet18 significantly outperformed SVM and RF methods in the bean field as it achieved an average AUC of 91.7% on both supervised and unsupervised labelled data in comparison to 52.68% using SVM and 66.7% using RF. On the other hand, RF resulted in a slightly better average AUC% in the Spinach field to that achieved using ResNet-18.
Zheng et al. [
73] proposed multiple CNN models to estimate percent canopy cover as well as vineyard leaf area index in each field. The authors compared the estimation performance of five different models, including a CNN-ConvLSTM model, a Vision Transformer model, a Joint Model, an Xception model, and a ResNet-50 model. The five models were trained on a dataset containing approximately 840 images extracted from UAV videos taken of vineyard fields at Alcorn State University. The five models were evaluated using the RMSE of both leaf area index (LAI) and Percent Canopy Cover. For the prediction of leaf area index, Xception, CNN-ConvLSTM, Vision Transformer, ResNet-50, and Joint model had RMSE of 0.28, 0.32, 0.34, 0.41 and 0.43 respectively. For predicting percent canopy cover, Xception, CNN-ConvLSTM, Vision Transformer, ResNet-50, and Joint model had RMSE of the 4.01, 4.50, 4.56, 5.98, and 6.08 respectively. Clearly, Xception performed best for both LAI estimation and percent canopy cover estimation.
Yang et al. [
74] proposed a method of multi-source data fusion for disease and pest detection of grape foliage using the ShuffleNet V2 model. The dataset consisted of 834 groups of grape foliage images. Each group contained three types of images of grape foliage: RGB Image (RGBI) (2592 × 1944, 3 channels), Multispectral Image (MSI) (409 × 216, 25 channels), and Thermal Infrared Image (TIRI) (640 × 512, 3 channels). The accuracy of MSI was 82.4%, RGB was 93.41%, and the TIRI was 68.26%.
Briechle et al. [
75] used multispectral images to classify tree species and standing dead trees. They used the PointNet++ model. The data used was UAV-based Light Detection and Ranging including Laser Echo Pulse Width (LIDAR) data and five-channel MS imagery. They also applied segmentation to the images during the preprocessing of the data. Their model achieved an accuracy of 90.2%.
Aiger et al. [
76] proposed a method of image classification based on multi-view image projections. Their method’s used projections of multiple images at multiple depth planes near the reconstructed surface. This enabled the classification of categories whose most noticeable aspect was appearance change under different viewpoints, such as water, trees, and other materials with complex reflection/light response properties. They obtained the best accuracy of 96.3% on their proposed 3D CNN.
Table 5.
Convolutional Neural Networks Summary.
Table 5.
Convolutional Neural Networks Summary.
Paper |
CNN |
Model/ Architecture |
Strengths |
Comments |
Best Results |
Crimaldi et al. [60] |
Inception V3 |
The identification time is 200ms which is good for real-time applications |
Low accuracy |
Accuracy of 78.1% |
Milioto et al. [61] |
CNN model fed with RGB+NIR camera images |
High accuracy for early growth stage |
Low accuracy for the later growth stage |
Early growth stage Accuracy: 97.3% Recall: 98% Later growth stage Accuracy: 89.2% Recall: 99% |
Bah et al. [62] |
AlexNet |
Less images with high resolution from a drone |
Overlapping of the leaves between crops and weeds |
Best precision was for the Spinach dataset with 93% |
Reddy et al. [63] |
Customized CNN |
The results had a high precision and recall |
Large dataset |
Precision of 99.5% for the leaf snap dataset. The flavia, Swedish leaf, UCI leaf datasets had a recall of 98%. |
Sembiring et al. [64] |
Customized CNN |
Low training time compared to other models compared in the paper |
Not the highest performing model compared in the paper |
Accuracy of 97.15% |
Geetharamani et al. [65] |
Deep CNN |
Can classify 38 distinct classes of healthy and diseased plants |
Large dataset |
Classification accuracy of 96.46% |
R. et al. [66] |
Residual learning CNN with attention mechanism |
Prominent level of accuracy and only 600k parameters which is lower than the other papers compared in this paper |
Large dataset |
Overall Accuracy of 98% |
Nanni et al. [67] |
ensembles of CNNs based on different topologies (ResNet50, GoogleNet, ShuffleNet, MobileNetv2, and DenseNet201) |
Using Adam helps in decreasing the learning rate of parameters whose gradient changes more frequently |
IP102 is a large dataset |
95.52% on Deng and 73.46% on IP102 datasets |
Bah et al. [77] |
CrowNet |
Able to detect rows in images of several types of crops |
Not a single CNN model |
Accuracy: 93.58% IoU: 70% |
Atila et al. [68] |
EfficientNet |
Reduces the calculations by the square of the kernel size |
Did not have the lowest training time compared to the other models in the paper |
Plant village dataset Accuracy: 99.91% Precision: 98.42% Original and augmented datasets Accuracy: 99.97% Precision: 99.39% |
Prasad et al. [69] |
EfficientDet |
Scaling ability and FLOP reduction |
Performs well for limited labelled datasets however, the accuracy is still low |
Identifier model average accuracy: 75.5% |
Albattah et al. [70] |
EffecientNetV2-B4 |
Really reliable results and has low time complexity |
Large dataset |
Precision: 99.63% Recall: 99.93% Accuracy: 99.99% F1: 99.78% |
Mishra et al. [71] |
Standard CNN |
Can run on devices like raspberry-pi or smartphones and drones. Works in real-time with no internet. |
NCS recognition accuracy is not good and can be improved according to the authors |
Accuracy GPU: 98.40% NCS chip: 88.56% |
Bah et al. [72] |
ResNet18 |
Outperformed SVM and RF methods and uses unsupervised training dataset |
Results of the ResNet18 are lower than SVM and RF in the spinach field |
AUC: 91.7% on both supervised and unsupervised labelled data |
Zheng et al. [73] |
Multiple CNN models including: CNN- Joint Model, Xception model, and ResNet-50 model. |
Compares multiple models |
Joint Model had trouble with LAI estimation, and the vision transformer had trouble with percent canopy cover estimation. |
Xception model: 0.28 CNN-ConvLSTM: 0.32 ResNet-50: 0.41
|
Yang et al. [74] |
ShuffleNet V2 |
The total params were 3.785 M which makes it portable and easy to apply |
Not the least number of params when compared to the models in the paper |
Accuracy MSI: 82.4% RGB: 93.41% TIRI: 68.26% |
Briechleet et al. [75] |
PointNet++ |
Good score compared to the models mentioned in the paper |
Not yet tested for practical use |
Accuracy: 90.2% |
Aiger et al. [76] |
CNN |
Large-Scale, Robust, and high accuracy |
Low accuracy for 2D CNN. |
96.3% Accuracy |
5.2.2. U-Net Architecture
The U-Net architecture was originally introduced in the medical domain by Ronneberger et al. [
78] and is commonly used for image segmentation. U-Net follows an encoder-decoder architecture. Many factors like density of the crops, flight height of the drone, and the growth stage have an impact on how well U-Net will perform. According to Kitano et al. [
79] U-Net did not perform well when the plants were remarkably close together. However, some techniques could be used to solve this problem, such as using the opening morphological operator.
Lin et al. [
80] used U-Net to achieve an accuracy of 95.5% and a RMSE of 2.5% with 1000 manually labelled training images. Arun et al. [
24] achieved an accuracy of 95.34% and an RMSE of 7.45 using reduced U-Net by designing an efficient pixel-wise for weeds and crops in agricultural field images. Hoummaidi et al. [
81] used the U-Net model to perform vegetarian extraction and achieved an overall accuracy of 89.7%. However, Palm trees and Ghaf trees had higher detection rates of 96.03% and 94.54%, respectively. The authors justified their results with the fact that trees were obstructed by other trees. Palm trees also caused some errors due to their physical characteristics and the small crown sizes of some trees. The authors suggested that including young Palms in the training data could improve the crown size error rate. Doha et al. [
82] used the U-Net architecture to detect crop rows by performing semantic segmentation on vertical aerial images. Zhang et al. [
83] used the Dual flow U-Net (DF-U-Net) to detect yellow rust severity in farmlands. The dataset was from the Yangling experiment field, which used a RedEdge camera on board a DJI M100 UAV with a sensor size of 1336 × 2991. The F1-score, accuracy, and precision scores were 94.13%, 96.93%, and 94.02%, respectively. Sparse Channel Attention (SCA) was designed to increase the receptive field of the network and improve the ability to distinguish each category. Using U-Net, Lin et al. [
80] achieved high accuracy with a small dataset. Similarly, with only 48 images, Tsuichihara et al. [
84] achieved an accuracy of about 80% for detecting broad-leaved weeds.
Table 6 provides a summary of studies using the U-Net architecture.
5.2.3. Other Segmentation Models
Efficient Dense modules of Asymmetric Convolution (EDANet) is another model that works well for real-time semantic segmentation. Therefore, EDANet can be useful for real-time applications like UAVs. Yang et al. [
85] proposed an EDANet that does semantic segmentation for detecting rice lodging. Lodging occurs when the stem weakens, and the plant falls over. EDANet outperformed many systems because of its efficiency, low computational cost, and model size. The model identified normal rice at 95.28% and lodging at 86.17% accuracy. The model accuracy was improved to 99.25% when less than 2.5% of rice lodging was neglected.
Weyler et al. [
86] proposed ERFNet-based instance segmentation model that segments individual crop leaves in plant imagery to extract relevant phenotyping information and then groups the instances that belong to one crop together. This model made use of two decoders, one of which was used to predict the offset of image pixels from leaf regions, while the other was used to predict the offset of image pixels from plant regions. The two decoder outputs were then used to generate one image with leaf clusters and another with plant clusters. The model was trained on a dataset of 1,316 RGB images of sugar beet fields captured by a camera onboard a UAV. The model was evaluated on its ability to perform crop leaf segmentation as well as full crop segmentation. In crop leaf segmentation, the model was able to achieve an average precision of 48.7% and an average recall of 57.3%. The model achieved an average precision of 60.4% and an average recall of 68% for crop segmentation.
Guo et al. [
87] developed a three stages model to perform plant disease identification for smart farming. The model located the diseased leaves using a Region Proposal Network (RPN) algorithm trained on a leaf dataset in complex environments, after which regression and classification neural networks were used to locate and retrieve the diseased leaves. Later, the Chan-Vese algorithm was used to perform segmentation based on the set zero level set and minimum energy function. Finally, the diseases were identified using a pre-trained transfer learning model. The proposed model outperformed the traditional ResNet-101 model significantly with an accuracy of 83.75% in comparison to 42.5% by the latter.
Sanchez et al. [
88] used a multilayer perception (MLP) neural network for the early detection of broad-leaved weeds and grass weeds in wide-row crops from UAV imagery. The data was manually collected using a UAV quadcopter equipped with a low-cost RGB camera. Image segmentation was done using the multiresolution segmentation algorithm (MRSA). The model achieved an average overall accuracy of 80.9% on two classes of crops.
Zhang et al. [
89] proposed a unified CNN called UniStemNet for joint crop recognition and stem detection in real-time. The architecture of UniStemNet is like Mask R-CNN. The architecture consists of a backbone and two subnets in which the first performs crop recognition, and the other performs stem detection simultaneously. The backbone consists of five convolutional stages, where the first is a standard CNN with batch normalization while the other four contain two MobileNet2 inverted residual modules (IRMs). The subnets follow a varied-span feature fusion structure as each has different detection targets. The evaluation was performed on the open-source CWF-788 dataset, and labels were manually annotated. The model obtained an F1-score of 97.4% and an IoU score of 94.5 in segmentation which was slightly lower than that achieved by CR-DSS [
90]. Nonetheless, the model achieved the best-known results in stem detection with an SDR of 97.8%. Summary of other segmentation models is presented in
Table 7 below.
5.2.4. YOLO ONLY LOOK ONCE (YOLO)
You Only Look Once (YOLO) is a real-time object detection neural network model where a single stage neural network is applied to the full image. The network divides the image into regions and predicts bounding boxes along with probabilities for each region. YOLO has been gaining popularity lately in for agricultural disease and crop detection. For example, Chen et al. [
91] proposed a UAV to photograph and detect pests and employed a Tiny-YOLOv3 model built on NVIDIA Jetson TX2 to recognize their position in real-time. The detected pest positions could later be used to plan optimal pesticide spraying routes, which agricultural UAVs would later follow. The model attained the best mAP score of 95.33% and 89.72% on 640*640 pixels test images.
Similarly, Qin et al. [
92] proposed a solution for precision crop protection based on a light deep neural network (DNN) called Ag-YOLO consisting of a modified version of ShuffleNet-v2 backbone, a ResBlock neck, and a YOLOv3 head. This model enabled the crop protection UAV to perform embedded real-time pest detection and autonomous spraying of pesticides. The model was tested on the Intel NCS2 hardware accelerator owing to its low weight and low power consumption. The detection system achieved an average F1-score of 92.05%.
Parico et al. [
93] proposed YOLO-WEED, a weed detection system trained with 720 annotated UAV images to detect instances of weeds, based on YOLOv3 using NVIDIA GeForce GTX 1060 for green onion crops. They obtained a mAP score of 93.81% and an F1-score of 94%.
Rui et al. [
94] proposed a novel comprehensive approach that combined transfer learning based on simulation data and adaptive fusion using YOLOv5 for improved detection of small objects. Their transfer learning and adaptive fusion mechanism led to a 7.1% improvement as compared to the original YOLOv5 model.
Parico et al. [
95] proposed a robust real-time pear fruit counter for mobile applications using only RGB data. Various variants of YOLOv4 (YOLOv4, YOLOv4-tiny, and YOLOv4-CSP) were compared. In terms of accuracy, YOLOv4-CSP was the best model with an AP of 98%. In terms of speed and computational cost, YOLOv4-tiny showed a promising performance at a comparable rate with YOLOv4 at lower network resolutions. If considering the balance in terms of accuracy, speed, and computational cost, YOLOv4 was found to be the most suitable with AP > 96%, inference speed of 37.3 FPS, and FN Rate of 6%. Thus, YOLOv4-512 was chosen as the detection model for the pear counting system with Deep SORT.
Jintasuttisak et al. [
96] exploited the effective use of YOLO-V5 in detecting date palm trees in images captured by a UAV flying above farmlands in the Northern Emirates of the United Arab Emirates (UAE). The results of using YOLO-V5 for date palm tree detection in drone imagery were compared with those obtainable with other popular CNN architectures, YOLOv3, YOLOv4, and SSD300, both quantitatively and qualitatively. The results showed that for the training data used, the YOLO-V5m (medium depth) model had the highest accuracy, resulting in a mAP of 92.34%. Further, it provided the ability to detect and localize date palm trees of varied sizes in crowded, overlapped environments and areas where the date palm tree distribution was sparse.
Tian et al. [
97] proposed an anthracnose lesion detection method based on deep learning. Cycle GAN was used for data augmentation. DenseNet was then utilized to optimize feature layers of the YOLO-V3 model, which had lower resolution. The improved model exceeded Faster R-CNN with VGG16 and the original YOLO-V3 model and could realize real-time detection. The model obtained an F1-score of 81.6% and 91.7% IoU on the entire dataset.
Table 8 presents a summary of methods using YOLO. As the Table shows, it is possible to get above 90% results from most YOLO models in a variety of domains.
5.2.6. Region-Based Convolutional Neural Networks
The Region-Based Convolutional Neural Network (R-CNN) is a two-stage object detection system that extracts many region proposals from input images, uses a CNN to perform forward propagation on each region proposal to extract its features, and then uses these features to predict the class and bounding box of this region proposal.
Sivakumar et al. [
99] proposed an approach where object detection-based CNN models were trained and evaluated using low-altitude UAV images to detect weeds in mid and late seasons in soybean fields. Faster RCNN and SSD were both evaluated and compared in terms of weed detection performance. When Faster RCNN was configured with 200 box proposals, its weed detection performance was like the SSD model. The Faster RCNN model with 200 box proposals returned a precision of 0.65, a recall of 0.68, an F1-score of 0.66 and IoU of 0.85. On the other hand, the SSD model returned 0.66, 0.68, 0.67 and 0.84, for precision, recall, F1-score, and IoU, respectively. The performance of a patch-based CNN model was also evaluated and compared to the previous models. Faster RCNN model performed better than the patch-based CNN model. In conclusion Faster RCNN was found to be the best model in terms of weed detection performance and inference time among the different models compared in this study.
Ammar et al. [
101] proposed an original deep learning framework for the automated counting and geolocation of palm trees from aerial images. They applied several recent convolutional neural network models (Faster R-CNN, YOLOv3, YOLOv4, and EfficientDet) to detect palm trees and other trees and conducted a complete comparative evaluation in terms of average precision and inference speed. YOLOv4 and EfficientDet-D5 yielded the best trade-off between accuracy and speed (up to 99% mAP and 7.4 FPS).
Su et al. [
102] used the Mask-RCNN model for identifying Fusarium head blight disease in wheat spikes and its degree of severity. To perform this task, two Mask-RCNNs performed instance segmentation on the input images, one of which segments individual spikes in the images, and the other segments diseased areas of spikes. Thereafter, the severity of infection of spikes was evaluated by calculating the ratio of infected spike pixels in the images to the total number of spike pixels. The backbone of this model, for feature map extraction was composed of a combination of a ResNet-101 model and an FPN model. The model returned a prediction accuracy of 77.19% after comparing the results to a set of manually labelled images.
Yang et al. [
103] used an FCN-AlexNet model to perform real-time crop classification using edge computing. The authors collected 224 images using a UAV during the growing period of rice and corn. The quantitative analysis showed that the SegNet model slightly outperformed FCN-AlexNet by 1% in the overall recall rate of object classification.
Menshchikov et al. [
104] proposed an approach for fast and accurate detection of hogweed. The approach includes a UAV with an embedded system on board running various Fully Convolutional Neural Networks (FCNN). They proposed an optimal architecture of FCNN for the embedded system relying on the trade-off between the detection quality and frame rate. In their pilot study, they determined that different architectures could successfully solve the semantic segmentation task for the aerial hogweed detection of two classes. The SegNet model achieved the best ROC AUC with 96.9%. This model could detect hogweed, which was not initially labeled. The Modified U-Net architecture was characterized by a high frame rate (up to 0.7 FPS) and a reasonable recognition quality (ROC AUC > 0.938). Along with the low power consumption, the U-Net architecture demonstrated its applicability for real-time scenarios and running on edge-computing devices. One of the U-Net modifications could achieve 0.46 FPS on the NVIDIA Jetson Nano platform with the ROC AUC of 0.958.
Bah et al. [
77] proposed a model that combined CNN and the Hough transform to detect crop rows in images taken by a UAV. The model called CRowNet was a combination of SegNet (S-SegNet) and a CNN Hough transform (HoughCNet). The model achieved an accuracy of 93.58% and IoU of 70%, respectively.
Hosseiny et al. [
10] proposed a model with the framework’s core based on a faster Regional CNN (R-CNN) model with a backbone of ResNet-101 for object detection. The proposed framework’s primary idea was to generate unlimited simulated training data from an input image automatically. The authors proposed a fully unsupervised model for plant detection in UAV-acquired pictures of agricultural fields. Two datasets were used with 442 and 328 field patches, respectively. The precision, recall, and F1-score were 0.868, 0.849, and 0.855, respectively.
Table 10 shows a summary of papers using two stage detectors.
5.2.7. Autoencoders
Weyner et al. [
105] addressed the problem of automated, instance-level plant monitoring in agricultural fields and breeding plots. They proposed a vision-based approach to perform a joint instance segmentation of crop plants and leaves in breeding plots. They developed a CNN-based encoder-decoder network with lateral skip connections that follows a two-branch architecture with two task-specific decoders to determine the position of specific plant key points and group pixels to detect individual leaf and plant instances. Finally, they did pixel-wise instance segmentation of each crop and its associated leaves based on orthorectified RGB images captured by UAVs. Their method outperformed state-of-the-art instance segmentation approaches such as Mask-RCNN on this task. They achieved the highest score of 0.94 for AP50 at intermediate growth stages compared to 0.71 by Mask R-CNN with respect to the instance segmentation of sugar beet plants.
Lottes et al. [
106] presented a novel approach for joint stem detection and crop-weed segmentation using a Fully Convolutional Network (FCN) integrating sequential information. Their proposed architecture enables the sharing of feature computations in the encoder while using two distinct task-specific decoder networks for stem detection and pixel-wise semantic segmentation of the input images. All their experiments were conducted using different generations of the BoniRob platform. BoniRob was built by BOSCH DeepField Robotics as a multi-purpose field robot for research and development applications in precision agriculture such as weed control, plant phenotyping, and soil monitoring. The system achieved the best mAP scores of 85.4%, 66.9%, 42.9%, and 50.1% for Bonn, Stuttgart, Ancona, and Eschikon datasets, respectively for stem detection and 69.7%, 58.9%, 52.9% and 44.2% mAP scores for Bonn, Stuttgart, Ancona, and Eschikon datasets, respectively for segmentation.
Su et al. [
107] proposed a Deep Neural Network (DNN) that exploits the geometric location of ryegrass for the real-time segmentation of inter-row ryegrass weeds in a wheat field. Their proposed method introduced two subnets in a conventional encoder-decoder style DNN to improve segmentation accuracy. The two subnets treat inter-row and intra-row pixels differently and provide corrections to preliminary segmentation results of the conventional encoder-decoder DNN. A dataset captured in a wheat farm by an agricultural robot at different time instances was used to evaluate the segmentation performance, and the proposed method performed the best among various popular semantic segmentation algorithms (Bonnet, SegNet, PSPNet, DeepLabV3, U-Net). The proposed method ran at 48.95 FPS with a consumer-level graphics processing unit and thus is real-time deployable at camera frame rate. Their proposed model achieved the best mean accuracy and IoU scores of 96.22% and 64.21%, respectively.
Table 11.
Autoencoder Summary.
Table 11.
Autoencoder Summary.
Paper |
Autoencoder |
Model/ Architecture |
Strengths |
Comments |
Best Results |
[105] |
CNN-Autoencoder |
Performed joint instance segmentation of crop plants and leaves using a two-step approach of detecting individual instances of plants and leaves followed by pixel-wise segmentation of the identified instances. |
Low segmentation precision for smaller plants. (Outperformed by Mask R-CNN) |
0.94 for AP50 |
[106] |
FCN-Autoencoder |
Performed joint stem detection and crop-weed segmentation using an autoencoder with two task-specific decoders, one for stem detection and the other for pixel-wise semantic segmentation. |
Did not achieve best mean recall across all tested datasets. + false detections of stems in soil regions |
Achieved mAP scores of 85.4%, 66.9%,42.9%, and 50.1% for Bonn, Stuttgart, Ancona, and Eschikon datasets, respectively for stem detection and 69.7%, 58.9%, 52.9% and 44.2% mAP scores for Bonn, Stuttgart, Ancona, and Eschikon datasets, respectively for segmentation.
|
[107] |
Autoencoder |
Utilized two position-aware encoder-decoder subnets in their DNN architecture to perform segmentation of inter-row and intra-row Rygrass with higher segmentation accuracy. |
Low pixel-wise semantic segmentation accuracy for early-stage wheat. |
mean accuracy and IoU scores of 96.22% and 64.21%, respectively. |
5.2.8. Transformers
Vaswani et al. [
108] proposed the transformer architecture based on the attention mechanism. A transformer is a sequence transduction model initially designed to tackle natural language processing (NLP) problems. Using transformers for computer vision tasks was limited initially due to the high computational cost of training. To address this issue, Dosovitskiy et al. [
109] proposed the Vision Transformer (ViT) that requires fewer resources while out-performing convolutional networks (CNN). Other notable contributions include utilizing Detection Transformers (DETR) targeting the same problem. [
110].
Thai et al. [
111] used ViTs for the early detection of infected cassava leaves and the classification of their diseases. Initially, they used the ImageNet pre-trained ViT model published by the Google Research Team [
112]. The model was then tuned using Cassava Leaf Disease Dataset [
113]. Later, the model was quantized to reduce its size and accelerate the inference step (FPS) before deploying it on a Raspberry Pi 4 Model B. Their model achieved a 90.3% F1-score in comparison to the best CNN score of 89.2% achieved by the Resnet50 model. Further, they proposed a smart solution powered by the Internet of Things (IoT) that can be used in the agriculture industry for real-time detection of leaf diseases. The system consists of a drone that captures the leaf images, including the exact position of the spot in the field. The ViT model installed on the Drones Pi classifies the images and clusters the infected leaves. The results are then combined with the spot’s position and sent to a server via a 4G network to create a survey map of the field. Farmers and rescue agencies can obtain the map on their mobile phones and prevent the loss of crops beforehand.
Reedha et al. [
23] used two different models of ViT for plant classification of UAV images. Images were collected using a drone mounted with a high-resolution camera and deployed in a crop field of beet, parsley, and spinach located in France. The camera captured RGB orthorectified images at regular intervals in the field. The data was manually labelled into five classes: Weeds, Beet, Parsley, Spinach, and Off-type green leaves. They also employed data augmentation to help improve the robustness of the model and the generalization capabilities of the training dataset. Later, they used ViT-B32 and ViTB16 models. They also tested the training data on EfficientNet and ResNet CNN architectures for comparison purposes. The results showed that ViT models outperformed the CNN models as an F1-score of 99.4% and 99.2% were obtained from ViT-B16 and ViT-B32, respectively. In comparison, CNN models achieved slightly lower scores of 98.7% for EfficientNet B0, 98.9% for B1, and a close 99.2% using ResNet50. The authors pointed out that although all techniques obtained high accuracy and F1-scores, the classification of crops and weed images using ViTs yielded the best prediction performance. However, the inefficiency of ViT as compared with CNNs is another consideration if the model is to be deployed for real-time processing on a UAV.
Karila et al. [
114] used ViT models to estimate grass sward (i.e., short grass) quality and quantity in a field. The datasets were captured in the spring “primary growth phase,” and the same dataset was captured again in the summer “regrowth phase” using a quadcopter drone equipped with two cameras. The first captured RGB images, while the second captured Fabry Perot (FPI) images. The results showed that ViT RGB models performed the best on different datasets. Similarly, VGG CNN models provided equally satisfactory results in most cases.
Dersch et al. [
115] used a detection transformer (DETR) to detect single trees in high-resolution RGB true orthophotos (TDOPs) and compared it to a YOLOv4 single-stage detector. The multispectral images were collected by a ten channels camera system with a horizontal field of view. Later, the images were post-processed using structure-from-motion (SFM) software. The data was later manually labelled with a split of 80% training and 20% for validation. DETR outperformed YOLOv4 in mixed and deciduous plots with a 20% difference in F1-score in mixed plots and 4% in the latter plots; 86% to 65% and 71% to 67%, respectively. Across all three test plots, both methods had problems with over-segmentation. Furthermore, DETR failed to detect smaller trees far worse than YOLOv4 in multiple cases. The authors justified these poor results by the fact that DETR uses lower resolution feature maps than that of YOLOv4.
Chen et al. [
116] proposed a new efficient deep learning model called the Density Transformer (DENT) for automatic tree counting from aerial images. The model’s architecture contains four stages: a Multi-Receptive Field CNN (Multi-RF CNN) to compute a feature map over the input images, followed by a standard transformer encoder, and a Density Map Generator (DMG) to predict the density distribution over the input images. They also introduced a benchmark dataset that contains aerial images for tree counting called Yosemite Tree dataset and released it to the public [
116]. The model outperformed most state-of-the-art methods with a MAE of 10.7 and a RMSE of 13.7 in comparison to 17.3 and 22.6, respectively, using YOLOv3. It is worth mentioning that the CANNet model [
117] achieved the closest values with 10.8 and 13.8, respectively, and achieved a better MAE score in one of four regions than the DENT models.
Finally, Zhang et al. [
118] developed a spectral-spatial attention-based transformer (SSVT) to estimate crop nitrogen status from UAV imagery. The model is an improved version of the standard Vision Transformer (ViT) that can extract the spatial information of images. The newly proposed model can predict the spectral information which contains most of the features in agricultural applications. The model also tackles the computational complexity of large images that ViT suffers from by adopting a self-supervised learning (SSL) technology to allow models to train with unlabeled data. The results showed that the model with 96.2% accuracy outperformed the ViT model with 94.4% accuracy. However, this model required 4 million additional parameters to those required for a ViT model.
Table 12 presents a summary of methods using transformers.