4.3. Comparison with Other CNN-Based Methods
In the whole training samples, the ratio of labeled images to unlabeled images is included as 1:10. To validate the performance of the proposed approach, we adopted several well-known and popular detectors for comparison, including supervised and semi-supervised detection methods. Among them, supervised detectors include Faster R-CNN [
15], YOLOv8 [
50] and Swin-Transformer [
21], which were trained with only labeled data. Semi-supervised detectors include Soft Teacher [
33] and TEST [
35], which were trained with both labeled and unlabeled data.
The quantitative analysis results for each algorithm on SSDD are presented in
Table 3. Several visual detection results on SSDD are presented in
Figure 3.
Faster R-CNN stands out as one of the most classical two-stage algorithms recog-nized for its high accuracy in supervised detection tasks. As demonstrated in
Table 3, it showcased commendable performance across various COCO metrics, with a notable AP
50 score of 0.839. As shown in the first row of
Figure 3, while it successfully identified several targets in the nearshore area, its performance suf-fered from missed detections for small targets located farther away. Moreover, it tended to detect partially adjacent targets as a single ship target, including instances where ships were adjacent to other ships or clutter.
The quantitative results highlight that the single-stage supervised detector YOLOv8 achieved the lowest performance on most indices among these methods. With an AP
50 score of only 0.766 and an AP
l score of merely 0.013, YOLOv8 demonstrated inferior performance compared to other detectors. Furthermore, the detection visualization results in
Figure 3 reveal that YOLOv8’s detection performance is not as good as to that of Faster R-CNN. YOLOv8 ex-hibited significant performance degradation in near-shore scenarios, missing a number of vessel targets. This deficiency may be attributed to its lightweight architecture and rapid detection process.
Among three supervised detectors, Swin-Transformer demonstrated commendably with an AP50 value of 0.878. Swin-Transformer was able to capture image detail information and model global contextual information. Despite advancements, it still suffered from high missed alarms for small targets on the far shore and an increased false alarm rate in nearshore scenarios.
Soft Teacher and TSET are semi-supervised detectors, both of which utilize the Faster R-CNN as the baseline. The former leverages a special loss function to learn from negative samples addressing issues of low recall rates, while the latter optimizes pseudo-labels by employing multiple teacher models.
Soft teacher achieved AP
50 accuracy of 0.868. Particularly, as prioritizing negative samples, it yielded favorable results in the detection of small far-shore targets. Nevertheless, due to its lack of emphasis on pseudo-label refinement, it was prone to produce missed alarms when filtering targets based on confidence. And, Soft Teacher achieved degraded detection performance in complex near-shore scenarios (e.g., multiple ships docked at the same port, as illustrated in the second of
Figure 3(b)). Given that the SSDD dataset primarily consists of small to medium-sized targets, Soft Teacher obtained superior detection performance in larger targets, with AP
l of 0.392.
Despite leveraging semi-supervised target detection, TSET’s performance remained subpar. As evident from the COCO metrics presented in
Table 3, its AP
50 score is a mere 0.769, falling behind the Swin-Transformer in several metrics. Moreover, as depicted in
Figure 3(d), TSET struggled with multiple far-shore targets, often ignoring or completely missing small targets. While there was an improvement in the accuracy of near-shore targets, TSET still exhibited more missed targets compared to the Swin-Transformer.
In contrast, our method outperformed all others in five COCO metrics, namely AP, AP
50, AP
75, AP
s, and AP
m, with respective values of 0.498, 0.905, 0.461, 0.504, and 0.483. Typically, attention is placed on the performance of AP
50. In this regard, our method demonstrated a notable improvement of approximate 4% compared to Soft Teacher. From
Figure 3, it can also be found that our approach excelled in gaining a high recall rate for far-shore targets. In multi-target far-shore scenes, our model succeeded to detect the majority of ship targets, significantly enhancing the recall rate. Although all other methods failed to distinguish adjacent docked ships accurately, our model effectively discerned ship targets in complex near-shore backgrounds. Specifically, in
Figure 3(b), our model successfully distinguished targets docked at the same port. While our model may produce a small number of false positive detections, the overall performance advantage is substantial in terms of decreased missing alarms. In summary, our method outperformed other five detectors in performance metrics.
The PR curves for each algorithm are depicted in
Figure 4, with the AP
50 values of each algorithm displayed alongside the plot. It is evident that our method achieves the maximum area under the curve (AUC), which is 0.90. This verifies that our method exhibits the best performance among all six algorithms.
Additionally, we conducted experiments on the more complex AIR-SARShip-1.0 dataset.
Table 4 gives the quantitative analysis results on this dataset for six algorithms. And, the detection results on three representative scenes for these algorithms are illustrated in
Figure 5. Similar to
Figure 3, the green boxes denote all detected targets achieved by each algorithm. The red ellipses represent the false alarms identified by the algorithms. The orange ellipses denote missed instances that the algorithms failed to detect.
On this dataset, supervised methods exhibit a noticeable decrease in performance compared to semi-supervised methods. This is mainly attributed to the complex environment and low image quality.
In terms of AP, all supervised methods fell below 0.3, while semi-supervised methods reached a minimum of 0.34. MTDSEFN obtained the highest AP value at 0.351. Regarding the crucial metric AP50, our method exhibited the best performance at 0.793. Notably, semi-supervised methods demonstrated a remarkable improvement of 0.1 compared to supervised methods. Additionally, the proposed method achieved near 2% improvement compared to the second place on this dataset. Due to the low resolution of images in the AIR-SARShip-1.0 dataset, which mainly comprises medium to large targets with very few small targets, all algorithms exhibited low APs values. In a nutshell, the proposed method achieved optimal performance in AP, AP50, APs, APm and APl metrics, with 0.351, 0.79, 0.097, 0.363 and 0.524, respectively.
As can be observed from
Figure 5, for Faster R-CNN, there are considerable false alarms of near shore targets. Even under far-shore conditions, significant missed detections occur. YOLOv8 had more false alarms compared to Faster R-CNN and demonstrated poorer quantification of its performance relative to the coco metrics. As for Swin-Transformer, it demonstrated an outstanding detection performance particularly in detecting far-shore targets, which could be observed from the results of the column scene in
Figure 5(b).
Semi-supervised models exhibited more superior performance in detecting far-shore targets. As can be seen from
Figure 5, most far-shore targets are successfully detected by three semi-supervised models. However, there are still huge challenges for detecting near-shore ships. Although Soft Teacher and TEST struggled to detect near-shore small targets, adjacent ship targets were not distinguished correctly in the second scene of
Figure 5(b). Additionally, in scene (c) of
Figure 5, both of them failed to detect two near-shore small targets in the upper right corner. In contrast, for our method, adjacent ships in the second scene were clearly distinguished, and two near-shore small targets were successfully detected in the third scene. Moreover, the proposed method did not exhibit a significant increase in false detections of docked ships.Briefly, the effectiveness of our method is demonstrated on both datasets.
Figure 6 displays the PR curves on AIR-SARShip-1.0 dataset for six algorithms. In this plot, the superiority of semi-supervised algorithms is more pronounced. Our approach, compared to all others, performed relatively better overall, maintaining higher precision under both low and high recall conditions. For our approach, the area under the curve (AUC) reaches 0.79, indicating its effectiveness and superiority.
4.4. Ablation Study
We conducted ablation experiments on the AIR-SARShip-1.0 dataset with a labeled-to-unlabeled data ratio of 1:10, so as to analyze the effect of different modules of our method. The experimental parameters remained consistent with the comparative experiments, and the results were summarized in
Table 5. The experimental results demonstrate that the joint utilization of the TG and AT modules leads to a remarkable increase in detection performance. Within the TG, two parts are em-ployed: multi-teacher and D-S evidence fusion. It is worth noting that D-S evidence fusion requires multiple sources of data as input. Thus, it is not applicable when multi-teacher is absent.
From
Table 5, it is evident that our method achieves optimal performance in four out of six Coco indicators. Specifically, the metrics AP, AP
50, AP
m, and AP
l reach the highest levels, with values of 0.351, 0.793, 0.363, and 0.524, respectively. Notably, AP50 exhibits a nearly 2% increase. However, for the AP
75 and AP
s indicators, the separate TG exhibited superior performance. AP
75 and AP
s, which denote targets with an IoU greater than 0.75 and smaller size, respectively, require more precise bounding box predictions. The potential inconsistency between the pseudo-bboxes generated by the AT and those generated by the TG may introduce bias into the bboxes learned by the student model. Consequently, when the AT is excluded, our method attained more accurate bbox predictions, reflected in higher AP
75 and AP
s performance. The experimental setup in the fourth row does not employ D-S evidence fusion, despite utilizing both the TG and the AT. As a result, the reliability of the pseudo-labels cannot be guaranteed, leading to suboptimal AP
50 performance. This underscores the crucial role of the D-S fusion mechanism proposed in this paper, which significantly enhances the quality of pseudo-labels and overall model performance.
In a nutshell, the experimental results indicate that the combination of the two pro-posed branches in our method effectively boosts the performance of the semi-supervised detector.
4.5. Hyperparameters Experiments
This section will explores the impact of each hyperparameter of the model on the model detection performance.
Firstly, we investigated the influence of the number of teachers in the TG. The experimental results, depicted in
Figure 7, reveal that increasing the number of teachers enhances the model performance. However, as the number of teachers grows, the computational load during the model training increases remarkably. Notably, when the number of teachers reaches 5 from 4, the accuracy improvement is tiny. To mitigate the computational burden, we select 5 teachers in the TG in our framework.
Next, we analyzed the impact of parameters and in the loss function on model performance.
Table 6 illustrates the effect of
on model performance, with AP
50 serving as the performance metric. Here, another parameter
is set to 1. The optimal performance is observed when the parameter value is 0.05. Inadequately small parameters as 0.01 impede the model from assimilating the latest knowledge. Conversely, excessively large parameters restrict the AT’s ability to guide the student learning, because the negative effect caused by incorrect pseudo-labels would be amplified resulting in the declined performance of the whole model.
Table 7 displays the impact of on model performance, with
fixed as 0.05. The best performance occurs when the value of
is 1. The TG is designed to obtain high-quality pseudo-labels. When the associated parameter is set too low, the generated labels are of lower quality. Conversely, if the parameter is too high, the model becomes overly reliant on the pseudo-labels generated by the TG, consequently neglecting the guiding information provided by the AT. This imbalance can lead to performance degradation.
Furthermore, we address the significance of the threshold hyperparameter
, which is used to judge sample trade-offs after D-S evidence fusion, shown in
Figure 8. An excessively large threshold yields a low recall rate, while an overly small threshold compromises pseudo-label quality. From the figure, it can be seen that the model has the best performance when the threshold is chosen as 0.6.