4.2. Dataset
The dataset used in this experiment is the aerial-drone floating objects(AFO) dataset proposed by [
30] specifically for the object detection work of floating objects, and the object category contains six categories: human, surfboard, boat, buoy, sailboat, and kayak. The dataset contains 3647 images and mostly large-size UAV images like 3840*2160, with more than 60,000 annotated objects. The original images used in this experiment are too large to be directly used for network training. To address this issue, we cropperd the original images into multiple 416*416 images to facilitate the training process. Additionally, we adopted the cropping method proposed in [
31], which involves leaving a 30% overlap when there is an object at the edge of the crop. This ensures the integrity of object information, as shown in
Figure 10. After cropping, we obtained a total of 33,391 416*416 images. These images are divided into the training set, validation set, and test set in the ratio of 8:1:1 for the training and testing of the model, respectively.
4.3. Comparison Experiment
To validate the effectiveness of our proposed method, we compared SG-Det with a range of commonly used lightweight object detectors, including SqueezeNet, MobileNetv2 [
32], MobileNetv3 [
33], ShuffleNetv2 [
34], GhostNet, YOLOv3-tiny [
35], YOLOv4-tiny [
36], and EfficientDet. The evaluation metrics used in this experiment include mean Average Precision(mAP), Frames Per Second(FPS), Giga Floating-point Operations Per Second(GFLOPs), and Param, which are used to assess the accuracy, inference speed, computational complexity, and parameter amount of the model, respectively. It's worth noting that SqueezeNet, MobileNetv2, MobileNetv3, ShuffleNetv2, and GhostNet are lightweight classification networks, not end-to-end object detectors. In our experiment, we removed their fully connected layer, following literature recommendations, and replaced the backbone network of Faster R-CNN to achieve object detection. We also conducted the same experiment on our proposed Shuffle-GhostNet to verify its effectiveness. The experimental results are shown in
Table 1.
Due to the decay of the number of output channels and the use of group convolution in its design, Shuffle-GhostNet has significantly lower computational complexity and parameter count compared to other lightweight networks. The number of output channels in Shuffle-GhostNet is only 1/6 of that in GhostNet, which could potentially lead to a decrease in model accuracy, as hypothesized. However, our experimental results showed an increase of 1% in mAP. This suggests that channel shuffling successfully enhances the information exchange between channel groups, and that the set number of channels is sufficient to effectively complete the detection task. The high FPS achieved by Shuffle-GhostNet demonstrates that our proposed method meets the timeliness requirements for maritime SAR. Although there is a slight accuracy gap compared to MobileNetv2, Shuffle-GhostNet achieves a balance between multiple performance parameters to meet the practical needs of production implementation.
To validate the effectiveness of our proposed object detector, we compared it with other end-to-end lightweight object detectors such as YOLOv3-tiny, YOLOv4-tiny, and EfficientDet. Additionally, to observe the contribution of each module, we included Shuffle-GhostNet based on Faster R-CNN for comparison. The experimental results are presented in
Table 2. Compared to the original BiFPN in EfficientDet, our proposed BiFPN-tiny combined with ASPP appears to be more focused and capable of fully utilizing the potential of multi-scale feature fusion, resulting in a significant improvement in both accuracy and speed. In comparison to the Faster R-CNN-based Shuffle-GhostNet detector, our approach not only achieves a slight improvement in accuracy and speed, but also significantly reduces the number of model parameters and computational effort required. This highlights the robustness and versatility of our overall framework, beyond just the effectiveness of Shuffle-GhostNet. In comparison to other lightweight object detectors, our proposed approach has a slightly lower FPS than YOLOv4-tiny. However, it still provides real-time detection capabilities, making it a suitable option for various applications.
To assess the detection performance of our proposed method on targets of varying sizes in UAV images, we listed the
,
, and
scores for each model, which respectively represent the average accuracy of detecting small, medium, and large targets.
Table 3 illustrates that each model exhibits distinct detection capabilities for objects of varying scales. The lightweight detector based on Faster R-CNN employs a deeper and wider network layer, thereby achieving superior detection performance for medium and large-scale objects. However, with increasing network depth, the detector's ability to identify small objects weakens, which poses a challenge to ensuring accurate detection of such objects.
Table 3 illustrates that each model exhibits distinct detection capabilities for objects of varying scales. The lightweight detector based on Faster R-CNN employs a deeper and wider network layer, thereby achieving superior detection performance for medium and large-scale objects. However, with increasing network depth, the detector's ability to identify small objects weakens, which poses a challenge to ensuring accurate detection of such objects. From the results shown in
Table 3, it can be found that different models have different detection capabilities for objects of different scales. The Faster R-CNN-based lightweight detector has a deeper and wider network layer, which has a better detection effect for medium and large-scale objects, but as the network layer deepens, the features of small objects become weaker and weaker, and it is difficult to guarantee the detection accuracy of small objects. Our proposed method, leveraging the strengths of BiFPN-tiny and ASPP, preserves small-scale features to a great extent, as supported by experimental results that demonstrate its effectiveness in detecting small objects in UAV images.
Then we conducted a thorough analysis of the experimental results of our proposed method, as presented in Figure 13, which displays the number and detection accuracy of various targets. Notably, the three targets with the lowest detection accuracy are also the least represented in the dataset. Furthermore, due to their high frequency, humans are prone to occlude and overlap other targets in real-world images, which can result in missed detections and misjudgments. Nonetheless, our proposed method achieves a detection accuracy of up to 91% for humans, the primary object of maritime SAR, which satisfies the requirements of practical applications. In summary, our proposed method achieves a better trade-off between performance index parameters, which is more advantageous for real-world maritime SAR applications.
4.4. Ablation Experiment
In order to verify the effectiveness of our proposed method and the contribution of each module, we conducted an ablation experiment. In this section, we added each module to the model step-by-step while ensuring that the experimental environment and configuration remained the same. The results of the ablation experiment are presented in
Table 4.
Our experimental results indicate that a single BiFPN-tiny has insufficient feature extraction capability, resulting in lower model accuracy. To address this limitation, we experimented with incorporating additional modules, such as ASPP and RFB [
37], to enhance the feature extraction ability of the network. Among these, we found that the ASPP module, which combines the advantages of atrous convolution with different sampling rates, was more effective at achieving multi-scale feature fusion and extraction. We also attempted to improve the network's feature extraction capabilities by incorporating attention mechanisms, such as CBAM [
38], to highlight the most important parts of the data. However, our experimental results showed that the feature extraction capability of the network was already close to saturation, and adding the CBAM module only complicated the network structure without improving its performance.
And then we added group convolution of group 2 and group 4, respectively. The experimental results showed that when the number of groups is 2, there is too much information in a single feature group, so the advantage of group convolution is not obvious. On the other hand, when the number of groups is 4, the network can effectively utilize the features from different channels, resulting in better performance. Therefore, we concluded that the design of the model with 4 groups is more reasonable. Lastly, we incorporate channel shuffling operations into the network architecture to optimize the exchange of information between different groups of channels, resulting in a notable enhancement of the model's overall performance. To date, we have validated the effectiveness of all proposed methods and models, ensuring the balance between accuracy and speed.