1. Introduction
In 2021, China’s pepper planting area accounted for 36.72% of the global planting area, and the production accounted for nearly half of the world, but at present, the degree of mechanized picking in China is low, because the current target detection algorithms can’t effectively identify the specific location of the pepper.
Deep learning algorithms have been proven to be the most robust target detection methods for automatic fruit picking, and many researchers have used different target detection methods for mAP and detection speed [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13]. Considering the accuracy and speed, Addie [
14] et al. used a variant of yolov4 and Deep SORT to provide a robust real-time pear fruit counter for a mobile application, which provides an effective support for automatic pear fruit picking and yield prediction. For the effects of environmental factors such as stem and leaf shading, uneven illumination, and fruit overlapping, Lawal [
15] proposed the YOLOFruit algorithm, which uses a spatial pyramid and a feature pyramid network to extract the detailed features, resulting in a fruit detection network with an average detection accuracy of 86.2% and a detection time of 11.9 ms. Li [
16] achieved 94.77% accuracy and 25.86 ms detection speed by segmenting the red region of tomato using HSV in the detection frame of yolov4 and using the tomato target with segmented area exceeding a certain percentage as the output. Similarly for the problem of picking peppers in natural environments, guo [
17] et al. introduced a deformable convolution and coordinate attention module in yolov5, which improved the mAP by 4.6% compared to the original model, and achieved a real-time detection speed of 89.3 frames/sec on a mobile picking platform. However, due to the diversity of pepper picking devices, the complex structure and large parameters of the above model make it difficult to deploy it to mobile hardware devices for real-time detection.
Many researchers realized the problem that huge models are difficult to be deployed to mobile devices, so they started to explore the path of lightweight models. Yang [
18] et al. used 76×76 detection head with CBAM attention mechanism network added to yolov4-tiny network, which reduces the number of model parameters while effectively solving the problem of occlusion and the low accuracy of small tomato recognition. While wang et al. [
19] added CBAM to FPN to learn the correlation of features between different channels by assigning weights to the features of each channel, to strengthen the transmission of deep information of the network structure, so as to reduce the interference of the complex background on the target recognition, and this kind of detection network has fewer network layers and occupies low memory. However, this is only using a lightweight network, and on the basis of which the accuracy is improved, and there is no substantial change. In contrast, Sun et al. [
20] obtained a small baseline model based on YOLOv5s by adding phantom structures and adjusting the overall width of the feature map, and introduced the migration learning technique, which realized a fast and accurate identification of apple chilis while occupying less computational resources. Similarly, rui et al. [
21] proposed a classification model for pepper quality detection based on the combination of migration learning and convolutional neural network, which achieved fast convergence and performance improvement in pepper detection. However, this kind does not achieve the lightweight of the model, and it is more to reduce the resources to achieve the model training. On the other hand, zhou et al. [
22] started from the equipment requirements, eliminated the feature mapping used for detecting large targets in the YOLOX model, sampled the feature mapping of small targets through the nearest neighbor value, spliced the surface features with the final features, perturbed the gradient of the SiLU activation function, and optimized the loss function at the output, which resulted in a reduction of the number of model parameters by 44.8%, and an increase in the speed of model detection by 63.9% with excellent performance. Zhang et al. [
23] implemented a GhostNet feature extraction network with a coordinate attention module in YOLOv4 and introduced deeply differentiable convolution to reconstruct the neck and YOLO head structure, thus realizing a lightweight apple detection model. However, these methods have limited changes to the model parameters, and the model performance is also degraded by the decrease of parameters, which has some defects. Aiming at these problems, Wang et al. [
24] used migration learning to establish a YOLO V5s detection model, and at the same time used a channel pruning algorithm to prune the YOLO V5s model, and fine-tuned the pruned model, which achieved an apple detection accuracy of 95.8%, with an average detection time of 8 ms /sheet, and the model size of only 1.4 MB, which effectively reduces the model size and ensures the effectively reduce the model size and ensure the model performance.
The success of the above methods proves the success of target detection in the field of fruit picking, but due to the problems of dense growth of chili fruits, uneven fruit size, severe occlusion of fruits by branches and leaves and similar backgrounds in chili pepper picking, it is difficult to target chili pepper fruits for efficient picking with the above methods [
25,
26,
27,
28,
29,
30,
31,
32]. At the same time, some of the current general-purpose models have problems such as insufficient model detection performance, large environmental interference factors, large model structure, and slow inference speed. In order to develop a deep learning model to meet the actual picking needs and realize intelligent picking of chili peppers. In this paper, in view of the existing problems of the current pepper picking model, it is proposed to use the three-channel attention mechanism network to help the neural network to extract the long-distance pepper information, to improve the model’s ability to recognize small target peppers, and to solve the problem that the current CBAM can not extract the long-distance information effectively. At the same time, the backbone network based on yolov5 is trained and the same detection mechanism is used, so as to ensure that the model can be transplanted to different devices and has the function of real-time detection. Then, a multi-scale prediction algorithm is established to improve the prediction layer structure of yolov5 so that it can detect peppers of different shapes, such as large, medium, small and medium-sized peppers, and improve the detection ability of small-targeted peppers. Finally, a multi-scale adaptive feature fusion pyramid is established to improve the model performance by introducing an adaptive spatial feature pyramid structure and combining the attention mechanism to suppress the background noise that causes interference, and at the same time adaptively fusing the features of different scales in the final prediction results.
Author Contributions
Conceptualization, C.H. and W.Y.; methodology, J.P; software, P.H.; validation, P.J., H.W. and Z.R; formal analysis, C.H.; investigation, C.H.; resources, W.Y.; data curation, P.H.; writing—original draft preparation, H.W.; writing—review and editing, J.P.; visualization, C.H.; supervision, C.H.; project administration, P.J.; funding acquisition, J.P. All authors have read and agreed to the published version of the manuscript.