5.1. Algorithm Comparison Experiment
In this study, we conducted a systematic evaluation of various mainstream object detection algorithms, including SSD, Faster R-CNN, RT-Detr-resnet18, and several versions of YOLO. By analyzing key performance metrics such as Precision, Recall, mAP@0.5, and mAP@0.5-0.95, we revealed significant differences in detection accuracy and efficiency across the algorithms.
Table 1 summarizes the comparison results. Notably, our proposed improved model (Ours) not only achieved outstanding accuracy while maintaining high computational efficiency but also demonstrated strong competitiveness in the mAP@0.5-0.95 metric, highlighting its immense potential for real-time detection applications. This result provides important insights into balancing detection accuracy and computational cost and offers strong support for selecting object detection algorithms for different practical scenarios.
Table 1 shows that Ours performs exceptionally well across multiple metrics, particularly in Precision and mAP@0.5. Ours achieves a Precision of 85.0%, higher than SSD’s 81.1%, demonstrating a significant improvement in detection accuracy. Compared to Faster R-CNN, Ours has an mAP@0.5 of 83.0%, while Faster R-CNN only reaches 76.7%, indicating Ours’ stable performance in detecting both large and small scale objects. Compared to RT-Detr-ResNet18, Ours also outperforms in both accuracy and mAP. RT-Detr-ResNet18’s Precision is 72.9%, lower than Ours’ 85.0%. In mAP@0.5, RT-Detr-ResNet18 is 74.7%, far below Ours’ 83.0%. In the mAP@0.5-0.95 evaluation, RT-Detr-ResNet18 scores 59.7%, while Ours achieves 66.9%, demonstrating stronger robustness. Although RT-Detr-ResNet18 has a higher FPS of 59.1 f/s compared to Ours’ 111.1 f/s, its floating point operations (78.1 GFLOPS) and parameter count (25.47M) are much higher than Ours’ 8.3 GFLOPS and 3.24M, giving Ours an edge in inference speed and computational efficiency. In comparison with the YOLO series, Ours still performs excellently. Ours’ Precision is 85.0%, higher than YOLOv5s’ 81.4%. In mAP@0.5-0.95, Ours scores 66.9%, similar to YOLOv8n’s 66.5%. Although YOLOv5n has a much higher FPS of 454.5 f/s compared to Ours’ 111.1 f/s, Ours strikes a better balance between accuracy and resource consumption, with only 3.24M parameters and 8.3 GFLOPS, significantly lower than YOLOv8n’s 28.5 GFLOPS and 11.13M parameters, showcasing its efficiency in resource-constrained environments. Compared to YOLOv10n, Ours’ Precision of 85.0% is significantly higher than YOLOv10n’s 73.8%. In mAP@0.5, Ours scores 83.0%, outperforming YOLOv10n’s 76.1%. In mAP@0.5-0.95, Ours achieves 66.9%, while YOLOv10n only reaches 59.3%. Although YOLOv10n has a higher FPS of 123.4 f/s, its accuracy and robustness are still inferior to Ours. In the comparison with YOLOv11n, Ours leads again. YOLOv11n’s Precision is 72.7%, and mAP@0.5 is 78.9%, both lower than Ours’ 85.0% and 83.0%. Although YOLOv11n has an FPS of 94.3 f/s, close to Ours’ 111.1 f/s, its detection stability and accuracy are not as good as Ours. When compared to YOLOv7-tiny, Ours also performs better. YOLOv7-tiny’s Precision is 74.6%, lower than Ours’ 85.0%, and in mAP@0.5, YOLOv7-tiny scores 79.2%, significantly below Ours’ 83.0%. In mAP@0.5-0.95, YOLOv7-tiny is 61.8%, while Ours is 66.9%. Although YOLOv7-tiny has a higher FPS of 344.8 f/s, exceeding Ours’ 111.1 f/s, its floating point operations (13.1 GFLOPS) and parameter count (6.04M) are much higher than Ours. Ours achieves a better balance between accuracy and computational efficiency. In conclusion, Ours demonstrates superior performance across various algorithm comparisons, showcasing significant advantages in precision, robustness, and computational efficiency.
Table 2 presents a comparison of precision across different algorithms for various object categories, with the Ours algorithm showing superior performance in multiple categories. Firstly, in the Car category, Ours achieves a precision of 88.7%, surpassing YOLOv5n at 87.4%, YOLOv5s at 86.1%, and SSD at 83.2%. This result indicates that Ours is the most precise in detecting common vehicles. For the Truck category, Ours achieves a precision of 81.0%, slightly lower than YOLOv5n’s 81.3%, but still maintains strong accuracy in this category. In the Van category, Ours achieves a precision of 79.0%, outperforming YOLOv7-tiny’s 70.7% and Faster R-CNN’s 57.1%, and is comparable to YOLOv5n’s 79.2%, demonstrating excellent stability. In the Long Vehicle category, Ours achieves a precision of 77.7%, significantly leading RT-Detr-resnet18’s 60.3%, showcasing higher detection ability in complex scenarios. For the Bus category, Ours achieves a precision of 90.3%, closely approaching YOLOv5s’s 90.8%, and surpassing YOLOv5n’s 88.6%, demonstrating outstanding performance. In aircraft detection, Ours achieves 94.1% precision for the Airliner category, nearly matching YOLOv5s’s 95.7%, while in the Propeller Aircraft category, it reaches a maximum precision of 98.7%, significantly outperforming other algorithms. In the Trainer Aircraft category, Ours achieves 92.2% precision, outperforming most algorithms. For Chartered Aircraft, Ours achieves a precision of 85.7%, surpassing YOLOv11n’s 79.5% and matching YOLOv10n, showing stable performance. In the Fighter Aircraft category, Ours achieves a precision of 80.3%, with a medium-level performance, still surpassing YOLOv11n’s 60.2%. In the Others category, Ours achieves a precision of 78.5%, significantly outperforming YOLOv7-tiny’s 32.3% and RT-Detr-resnet18’s 37.2%, demonstrating excellent capability in handling irregular objects. For Stair Truck and Pushback Truck, Ours achieves precisions of 66.1% and 72.4%, respectively, maintaining solid performance in these specialized categories. In the Helicopter category, Ours achieves a precision of 92.7%, outperforming YOLOv11n’s 75.8%, and in the Boat category, Ours achieves a precision of 97.1%, surpassing all other algorithms, including YOLOv5n’s 96.4%. Overall, Ours demonstrates outstanding performance in the detection of various categories, particularly excelling in key categories such as Cars, Trucks, Long Vehicles, Aircraft, and Boats, surpassing most existing algorithms, proving its strong robustness and high efficiency in complex scenarios.
Table 3 presents a comparison of various algorithms based on Average Precision (AP) across different object categories. The Ours algorithm shows excellent overall performance across all categories, particularly achieving high precision in several key categories. Firstly, in the Car category, Ours achieves an average precision of 94.4%, the highest value, matching YOLOv7-tiny’s 94.4%, and surpassing YOLOv5n’s 93.5%, SSD’s 91.2%, and RT-Detr-resnet18’s 91.1%. In the Truck category, Ours achieves an average precision of 83.1%, significantly outperforming YOLOv5n’s 81.0% and RT-Detr-resnet18’s 74.9%. This performance indicates that Ours maintains high precision and stability in common road vehicle detection tasks. In Van detection, Ours achieves an average precision of 84.5%, comparable to YOLOv7-tiny’s 84.6%, and clearly outperforming YOLOv5n’s 82.1% and RT-Detr-resnet18’s 76.5%. For Long Vehicle detection, Ours achieves an average precision of 86.1%, performing excellently, close to YOLOv5n’s 86.6%, and far surpassing RT-Detr-resnet18’s 74.3%. In the Bus category, Ours achieves a high average precision of 92.9%, close to YOLOv5s’s 94.3%, and surpassing RT-Detr-resnet18’s 84.1%. In aircraft detection, Ours also shows outstanding performance. For Airliner detection, Ours achieves an average precision of 98.3%, higher than RT-Detr-resnet18’s 95.2%. In Propeller Aircraft detection, Ours achieves 98.4%, close to YOLOv5n’s 99.1%, and significantly surpassing RT-Detr-resnet18’s 93.5%. For Trainer Aircraft, Ours achieves 97.0%, just behind YOLOv5s’s 98.2%, and far above RT-Detr-resnet18’s 92.3%. In Chartered Aircraft detection, Ours achieves 95.5%, comparable to YOLOv7-tiny, and significantly higher than RT-Detr-resnet18’s 89.2%. In Fighter Aircraft detection, Ours achieves an average precision of 98.2%, higher than Faster R-CNN’s 97.0%. In more specialized categories, Ours achieves the highest average precision of 40.4% in the Others category, significantly outperforming other algorithms, such as YOLOv5s’s 29.7% and RT-Detr-resnet18’s 25.3%. For Stair Truck detection, Ours achieves an average precision of 43.3%, outperforming Faster R-CNN’s 45.1%. In Pushback Truck detection, Ours achieves 48.9%, significantly outperforming YOLOv5s’s 34.2%. In the Helicopter category, Ours achieves 85.3%, outperforming most algorithms, including RT-Detr-resnet18’s 59.3%. In Boat detection, Ours achieves 98.5%, comparable to YOLOv5n, and surpasses Faster R-CNN’s 91.0%. Overall, Ours demonstrates exceptional average precision in
Table 3, especially excelling in Car, Truck, Aircraft, and specialized vehicle detection tasks, outperforming most of the comparative algorithms. This proves its strong robustness and accuracy in multi-category object detection tasks.
5.2. Ablation Study
To evaluate the impact of different modules on the performance of the YOLOv8 model, we designed and conducted an ablation study. We performed experimental comparisons of the YOLOv8 model and its progressively enhanced versions with various modules to analyze their effects on model performance. The experimental results cover metrics such as Precision, Recall, mAP@0.5, Frames Per Second (FPS), computational complexity (GFLOPS), and the number of parameters (Params). By gradually introducing the AFGCAttention mechanism, CARAFE upsampling operator, C2f-DCNV2-MPCA optimization module, and GIoU loss function, we were able to clearly observe the impact of these improvements on both model performance and computational overhead. The experimental results are summarized in
Table 4.
Table 4 shows the following results: Initially, the baseline model YOLOv8n exhibits good performance, with a precision of 80.0%, recall of 79.4%, mAP@0.5 of 81.0%, a high FPS of 357.1 frames, computational complexity of 8.1 GFLOPS, and 3.00M parameters. Next, by adding the AFGCAttention mechanism, although precision decreased to 76.1%, recall slightly improved to 79.9%, and mAP@0.5 increased to 82.2%, FPS significantly dropped to 123.4, indicating a substantial decrease in computational speed, while computational complexity remained unchanged and the model parameters slightly increased to 3.07M. After introducing the CARAFE upsampling operator, precision increased to 78.3%, recall slightly decreased to 78.9%, mAP@0.5 remained at 82.2%, FPS rose to 131.5, computational complexity increased to 8.4 GFLOPS, and parameters increased to 3.21M, indicating improved performance, especially in computational speed. Further adding the C2f-DCNV2-MPCA optimization module resulted in a significant increase in precision to 84.0%, but recall dropped to 77.0%, mAP@0.5 slightly rose to 82.7%, FPS slightly decreased to 126.5 frames per second, GFLOPS slightly reduced to 8.3, and parameters increased to 3.24M. This shows that while maintaining high precision, the model sacrificed some computational speed. Finally, after adding the GIoU loss function, precision further increased to 85.0%, but recall decreased to 75.3%, mAP@0.5 slightly improved to 83.0%, and FPS significantly dropped to 111.1 frames per second. GFLOPS and parameters remained unchanged at 8.3 GFLOPS and 3.24M, respectively. Overall, with the gradual addition of modules, the model showed significant improvements in detection accuracy and mAP@0.5, but a decline in computational efficiency (FPS), especially after introducing the AFGCAttention and GIoU modules, where the speed decrease was more pronounced. This indicates that, while improving performance, there is a trade-off between accuracy and computational efficiency.
5.3. Result Visualization
To provide a more intuitive demonstration of the improved YOLOv8 model’s detection performance, this paper uses visualization tools such as heatmaps to analyze the detection results. By comparing the heatmaps of different improvement schemes, it is evident that the model performs more accurately in locating small and dense targets after the introduction of the AFGCAttention attention mechanism module. The application of the AFGCAttention mechanism helps the model better focus on the key areas in the image, thereby improving detection accuracy.
The heatmap examples showcase some typical detection results in complex scenarios. When handling scenes with multiple dense targets, the improved YOLOv8 model demonstrates stronger object recognition capabilities. Specifically, the heatmap clearly shows that the model can accurately locate targets such as trucks, airplanes, and cars, significantly reducing missed and false detections. These visualization results strongly validate the effectiveness of the proposed improvements in this paper and provide important reference for future research and practical applications.
As shown in
Figure 9, the heatmap in (a) presents a more concentrated area, primarily focused on the center of the airplane. This indicates that the original model’s attention was more biased toward localized regions, capturing only certain prominent features in the input data while overlooking other valuable information. This localized attention results in limited performance when the model handles complex scenarios. In contrast, the heatmap in (b) displays a more uniform and extensive feature capture range. The introduction of the AFGCAttention mechanism allows the model to focus more comprehensively on different areas of the entire image, thus better balancing local and global information. This broader attention distribution not only improves the model’s ability to capture fine details but also enhances its understanding and processing of complex scenes, further improving the model’s accuracy and generalization capability.
The following figure shows the performance of RT-Dert-resnet18, YOLOv5s, YOLOv5n, YOLOv7-tiny, YOLOv8n, YOLOv10n, YOLOv11n, and the proposed improved algorithm (Ours) in the remote sensing image object detection task. Through these comparison charts, the performance differences between the various algorithms, particularly in handling complex backgrounds and small target scenarios, become more apparent.
Figure 10 shows a comparison of the performance of different algorithms in remote sensing image object detection, particularly in terms of the number and coverage of detection boxes. The algorithm labeled “Ours,” marked with yellow boxes, demonstrates outstanding performance in handling small target detection tasks. The RT-Dert-resnet18 algorithm accurately detects the location of objects but still suffers from missed detection of small targets. Additionally, it has a high computational overhead and slow convergence speed. For YOLOv5s and YOLOv5n, they maintain relatively high detection speeds while preserving a certain level of detection accuracy. As shown in the figure, these two algorithms perform reasonably well when detecting multiple targets in the image, correctly identifying objects such as cars, trucks, and vans. However, in some complex backgrounds, YOLOv5 may experience missed or false detections. For example, in dense target scenarios, the detection of small targets is not ideal. The YOLOv7-tiny algorithm performs relatively poorly, especially in handling complex scenes and detecting dense small targets. Although it has a fast detection speed, making it suitable for real-time applications, its accuracy and target recognition ability are noticeably lacking. Compared to other YOLO versions, YOLOv7-tiny performs relatively poorly overall. YOLOv8n is the lightweight version of the YOLOv8 series, characterized by its efficiency and fast speed, making it suitable for resource-limited environments. Despite the compactness of the YOLOv8n model and its lower computational resource requirements, it still shows some shortcomings in detecting small targets and handling complex scenes with dense targets. Overall, its performance is somewhat limited. For the YOLOv10n algorithm, despite its lightweight optimization, which makes it suitable for resource-constrained scenarios, its performance in complex backgrounds and small target detection is subpar. Compared to YOLOv8n, YOLOv10n shows further degradation in accuracy and stability, often resulting in missed or false detections, especially in the case of dense small targets. The YOLOv11n algorithm, while still a lightweight model, performs better than YOLOv10n. YOLOv11n shows more stable performance in detecting complex backgrounds and small targets, with improved accuracy, especially in reducing missed detections in dense scenes. The “Ours” algorithm performs the best in small target detection tasks among all the algorithms, particularly excelling in the number and coverage of detection boxes. It can accurately identify targets, especially demonstrating exceptional detection accuracy in dense small target scenarios, significantly reducing missed detections and showcasing strong robustness and stability.
Figure 11 show the Precision-Recall (P-R) curves of the YOLOv8n algorithm and the improved YOLOv8n algorithm, respectively, under different categories. These curves visually present the performance comparison of the two algorithms in various object detection tasks.
Figure 11a demonstrates the performance of YOLOv8n, while
Figure 11b illustrates the improvement in precision achieved by the improved YOLOv8n algorithm. These charts provide a clear perspective for comparing the detection accuracy before and after optimization, helping to further validate the effectiveness and advantages of the improved algorithm.
Figure 11 shows that in specific categories, the improved algorithm’s AP for the Car category increased from 0.932 to 0.944, demonstrating a significant improvement in detection for this category. Similarly, in the Van category, the AP increased from 0.826 to 0.845, further enhancing the classification performance for this category. In the Long Vehicle category, the AP improved from 0.856 to 0.861. Although the increase is small, the model’s performance has become more stable.
In the detection of complex categories, the improved algorithm also shows significant performance improvements across multiple categories. For example, the AP for the Pushback Truck category increased from 0.438 to 0.489, indicating a significant enhancement in the model’s detection ability for this category. Similarly, in the Stair Truck category, the AP increased from 0.407 to 0.433, demonstrating better classification accuracy. The AP for the Helicopter category improved from 0.779 to 0.853, significantly improving the detection of complex objects.
Additionally, in the Airliner category, the improved algorithm maintained stable performance, with the AP increasing from 0.982 to 0.983. In the Boat category, the AP further increased from 0.981 to 0.985, demonstrating extremely high precision. For the Chartered Aircraft category, the AP improved from 0.934 to 0.955, showing a significant enhancement in object detection for this category. In the Trainer Aircraft category, the AP increased from 0.964 to 0.970, indicating an improvement in detection accuracy for this category as well.
Overall, the improved algorithm’s mAP@0.5 increased from 0.810 to 0.830, demonstrating an overall improvement in detection performance. These charts clearly compare the performance differences between the original algorithm and the improved algorithm across different object detection tasks, proving the effectiveness of the improved algorithm, especially in the significant enhancement of both common and complex categories.
Figure 12 show the Precision-Confidence curves of the YOLOv8n algorithm and the improved YOLOv8n algorithm, respectively, for performance comparison and analysis.
Figure 12 shows that the blue thick line represents the average performance across all categories. YOLOv8n achieves a precision of 1.00 at a confidence level of 0.99, while the improved algorithm reaches the same precision at a confidence level of 0.96. From the shape of the curves, most categories (such as Car, Truck, Bus, Airliner, etc.) show a gradual increase in precision as the confidence level rises, reaching close to 1.0 in the high-confidence range. However, certain categories (such as Pushback Truck) exhibit noticeably lower detection precision, especially in YOLOv8n, where the precision is lower and fluctuates significantly at medium confidence levels.
In
Figure 12b, the improved algorithm shows improved performance in some categories (such as Pushback Truck), with the curve becoming steeper. This indicates that the improved algorithm has reduced instability in detection, especially in the medium-confidence range.
Overall, the improved YOLOv8n algorithm demonstrates more stable performance across most categories and achieves precision comparable to the standard algorithm at lower confidence levels, highlighting the effectiveness of the optimization. However, it is worth noting that, despite the improvements, the precision for the Long Vehicle category has not significantly increased, indicating that this category still presents challenges in detection. In summary, the improved algorithm has made breakthroughs in reducing detection fluctuations and enhancing the detection precision for certain categories, but there are still a few difficult-to-detect categories that require further optimization.