1. Introduction
Object detection in remote sensing imagery [
1,
2,
3,
4,
5] holds a significant role in identifying and pinpointing the exact locations of objects within an image. Its diverse applications span environmental monitoring, military applications, national security, transportation, forestry, and oil and gas activity detection. Remote-sensing images originate from varied sources like aerial, satellite, and unmanned aerial vehicle platforms. However, the complexities in remote sensing imagery—comprising intricate backgrounds, arbitrary object orientations, varying object densities, and differences in object size ratios—pose substantial challenges for small object detection. Rotating bounding boxes, contrary to conventional horizontal ones, notably decrease background overlap and provide more accurate object boundary delineation. Consequently, there’s an increasing need for research in rotating object detection within remote sensing imagery.
In remote sensing images, the same object can vary significantly in appearance based on the background, resulting in notable intra-class variability. This is especially prominent in fine-grained remote sensing images where object class distinctions are less apparent. Leveraging feature information to the fullest becomes crucial for effective detection in such instances. Multi-level feature pyramid networks are commonly used to address the challenge of object scale variations in remote-sensing images. The Feature Pyramid Network (FPN) [
19] framework comprises higher-level feature maps with richer semantic information but smaller scales, making them less efficient in detecting small objects. On the contrary, lower-level feature maps have larger scales but lack distinctive object representations. To bridge this discrepancy, FPN incorporates a top-down lateral connection structure, facilitating semantic information flow from higher to lower-level features, thereby enabling object detection across various scales. As a result, extensive research is dedicated to further enhancing FPN to better suit the demands of object detection in remote sensing images.
The DCFPN [
25] employs densely connected multi-path dilated layers to encompass objects of diverse sizes in remote sensing environments, enabling dense and accurate extraction of multi-scale information, thereby boosting the detection prowess for varying-sized objects. LFPN [
26] accounts for both low-frequency and high-frequency features, utilizing trainable Laplacian operators to extract high-frequency object features from Laplacian pathways. It introduces an attention mechanism within the feature pyramid network to emphasize more distinct multi-scale object features. SPH-YOLOv5 [
12] integrates an attention mechanism into FPN, aiding in capturing semantic information among features to highlight crucial spatial features and suppress redundant ones. Info-FPN [
23] introduces a PixelShuffle-based lateral connection module (PSM) to fully preserve channel information within the feature pyramid. Simultaneously, to mitigate confusion arising from feature misalignment, it proposes a feature alignment module (FAM). FAM uses template matching and learns feature offsets during feature fusion to achieve alignment. However, existing FPN-based methods often neglect the drawbacks of the feature pyramid network structure, inadequately exploiting original feature information and encountering performance issues due to attention mechanisms. These limitations lead to reduced feature representation capacity, particularly noticeable when dealing with objects exhibiting significant scale variations in remote sensing images.
In summary, The The existing challenges are as follows: (1) Most SPPBottleneck modules lack the capability to capture both coarse spatial information and fine-grained feature details, constraining further detection of the target of interest. (2) The mutual fusion among features at different layers within the feature pyramid is not thorough enough, leaving room for further improvement in the extraction and enhancement of the fused feature information. Additionally, a majority of feature pyramid networks do not fully exploit original features, yet these original features play a crucial role in reinforcing feature fusion, enhancing residual functions, and ensuring stable gradient propagation during backpropagation.
In this paper, we propose robust solutions to overcome the mentioned challenges. Leveraging the RTMDet model as our baseline, we substitute the SPPBottleneck module with a Focused Feature Context Aggregation Module (FFCA Module). This module effectively captures coarse spatial information and fine-grained feature details at different scales, gathering intricate details across various target scales, thereby enhancing the model’s perception of the targets. Additionally, we design a multi-scale feature fusion feature pyramid to integrate spatial context within feature maps, maximizing the amalgamation of feature information across layers, and consequently enhancing the model’s representational capacity. These solutions seamlessly integrate into object detectors, enhancing detection performance without increasing training complexities. To summarize, our contributions are outlined below:
In the backbone network, we utilize a multi-level feature fusion mechanism to acquire features of different scales. Subsequently, context information is selectively extracted from local to global levels at varying granularities, resulting in feature maps equal in size to the input features. Finally, these feature maps are injected into the original features to obtain relevant information about the objects of interest without altering their size.
We design a feature aggregation module that assigns varying attention across multiple dimensions to the fused feature map information, thereby improving performance in capturing rich contextual information and consequently enhancing pixel-level attention towards objects of interest.
Within the feature pyramid, we efficiently harness original feature information to process multi-scale features more effectively by introducing a multi-scale fusion pyramid network. This network connects original features and fused features while shortening the information transmission paths, extending from large-scale features to fused small-scale features, and enabling the module to optimally utilize features at each stage.
We introduce a novel object network and conduct extensive experiments on three challenging datasets: MAR20, SRSDD, and HRSC, affirming the effectiveness of our approach. The experimental results demonstrate outstanding performance.
4. Experiments
This section assesses the effectiveness of our proposed model by training and testing it on three widely used datasets: MAR20, SRSDD, and HRSC. We present a comprehensive overview of our experiments, covering experimental design, parameter configurations, comparisons with state-of-the-art (SOTA) models, and experimental outcomes. Additionally, we conducted an ablation study on the MAR20 dataset to demonstrate the effectiveness of each module. Our software environment includes CUDA 11.8, Python 3.8.10, PyTorch 2.0, mmdetection3.1.0, and mmrotate1.x. The hardware setup includes an Intel(R) Xeon(R) Platinum 8350C @ 2.60GHz, NVIDIA GeForce RTX 3090, and 80GB of memory. Configuration files follow the default settings of mmrotate, with a linear decay in learning rate for the first 1000 iterations, followed by a cosine decay at . All experiments are assessed using DotaMetric. The AdamW optimizer is utilized with a base learning rate of 0.00025, a momentum of 0.9, and a weight decay of 0.05 for all experiments. Random seeds for both the numpy library and tensors are set to 42.
4.1. Datasets and Evaluation Metrics
4.1.1. Datasets
The
MAR20 [
64] dataset stands as the largest publicly available dataset for recognizing military aircraft targets in remote sensing images. It includes 3842 images featuring 20 distinct military aircraft models, totaling 22341 instances. Most images have a resolution of 800×800 pixels. These instances were gathered from 60 military airfields situated in countries like the United States, Russia, and others, using Google Earth imagery. The MAR20 dataset comprises a specific array of 20 aircraft models, including six Russian aircraft such as the SU-35 fighter, TU-160 bomber, TU-22 bomber, TU-95 bomber, SU-34 fighter-bomber, and SU-24 fighter bomber. The remaining 14 models belong to the United States, including the C-130 transport plane, C-17 transport plane, C-5 transport plane, F16 fighter, E-3 AWACS (Airborne Warning and Control System) aircraft, B-52 bomber, P-3C anti-submarine warfare aircraft, B-1B bomber, E-8 Joint Surveillance Target Attack Radar System (Joint STARS) aircraft, F-15 fighter, KC-135 aerial refueling aircraft, F-22 fighter, F/A-18 fighter-attack aircraft, and KC-10 aerial refueling aircraft. These aircraft model types are labeled A1 to A20. The training set contains 1331 images and 7870 instances, while the test set includes 2511 images and 14471 instances.
The
SRSDD [
73] dataset is a high-resolution Synthetic Aperture Radar (SAR) dataset designed for ship detection, characterized by complex backgrounds and notable interference. The original SAR images are in spotlight mode, displaying HH and VV polarization. Annotations within the dataset employ rotated bounding boxes, specifically suitable for detecting objects within rotational frames. It consists of 666 smaller patches extracted from 30 China High-Resolution Gaofen-3 SAR panoramic images at a 1-meter resolution, with each patch containing 1024×1024 pixels. The dataset includes 2884 ship instances distributed among six distinct categories: Container, Dredger, Ore-oil, LawEnforce, Cell-Container, and Fishing, containing 89, 263, 166, 25, 2053, and 288 instances, respectively. Most images in the dataset capture coastal areas, featuring intricate background interferences, which pose substantial challenges for detection.
HRSC [
65] is a widely utilized benchmark for arbitrary-oriented object detection. It consists of 1061 images ranging in size from 300×300 to 1500×900. The training set comprises 436 images, the validation set has 181 images, and the rest are designated for testing. Regarding evaluation metrics, we utilize COCO-style mean average precision (mAP) along with average precision scores at 0.5 and 0.75 IoU thresholds (AP50 and AP75) for HRSC.
4.1.2. Evaluation Metrics
Various commonly used metrics for Remote Sensing Object Detection (RSOD) are used in the experiment to evaluate the proposed model’s effectiveness. Average Precision (AP) is utilized in this paper as the performance metric for object detection models. The calculation formula for AP is:
TP represents correctly classified targets, FP signifies background identifications as targets, and FN indicates object identifications misclassified as background. Precision (p) is the ratio of correctly identified targets to all detected results, while Recall (r) is the ratio of correctly identified targets to the true values of all targets. The area under the curve with p on the vertical axis, r on the horizontal axis, and the coordinate axes represents the AP value. AP considers both precision and recall, where a higher value suggests better detection accuracy. The mean Average Precision (mAP) for each class is calculated with the formula below:
Here, N represents the number of object categories. mAP@0.5 indicates the mean average precision for all classes at an Intersection over the Union (IoU) threshold of 0.5. mAP@0.5:0.95 denotes the average mAP calculated across IoU thresholds from 0.5 to 0.95.
4.2. Implementation details
We perform experiments utilizing RTMDet [
18] within the MMRotate toolbox [
56]. Our experiments adopt the configuration from RTMDet, employing CSPNetXtBlock as the backbone network and CSPNetXt-PAFPN as the neck. Throughout the model training phase, we utilize diverse data augmentation techniques like random flipping, rotation, scale variation, and padding. Scale variation augmentation is similarly applied in the testing and inference phases. In comparative experiments, we uphold consistent hyperparameter settings during training to ensure a fair comparison with other SOTA methods.
The MAR20 dataset is divided into patches of 800x800 pixels with a 200-pixel overlap between contiguous patches. During the training, validation, and testing phases of the SRSDD and HRSC datasets, we resize the images to 1024x1024 and 800x800 pixels, respectively, using data augmentation techniques without cropping. We use the training subset for training purposes and the test subset for validation and inference. The training duration comprises 36 epochs for the MAR20 dataset, 144 epochs for the SRSDD dataset, and 108 epochs for the HRSC dataset to derive the inference model.
4.3. Comparisons with SOTA
We compare our proposed method with other SOTA approaches using the MAR20, SRSDD, and HRSC datasets. As indicated in the table, our method exhibits superior performance compared to the SOTA approaches without excessive details.
4.3.1. Results on MAR20
MAR20 is a detailed dataset specifically created for detecting military aircraft, covering a broad spectrum of target sizes. It comprises remote sensing images captured in diverse climatic conditions, various seasons, and under differing lighting conditions. Due to the modules we designed that combine both convolutional and attention characteristics, our model efficiently extracts features and aggregates feature contexts, obtaining high-quality feature maps. This enables effective category recognition and precise learning of object bounding boxes, resulting in significantly higher accuracy than the current SOTA. We have chosen various object categories at different scales and scenes where objects are arranged densely and sparsely against different backgrounds for visualization. The detection results are illustrated in the
Figure 6. It can be observed from the figures that the proposed method accurately detects densely arranged objects.
Table 1 presents the specific performance metrics for each object category. For individual categories like A11, A13, and A14, there is considerable room for improvement in detection results due to the limited number of training instances for each class, which is fewer than 200. Similarly, some small object categories (such as A15 and A20) face challenges in accurate detection owing to their small size, with approximately 70% of instances having pixel values less than 100 pixels. Additionally, the similarity between the A13 and A15 classes, both representing aircraft, further complicates accurate detection. The same conclusion can be obtained by analyzing the mAP. Overall, our approach outperforms most categories and achieves an outstanding performance of 85.96%.
In the MAR20 dataset, we selected two images from the test set for showcasing feature heatmaps, and the feature heatmaps of the baseline model and MFCANet at scales P3, P4, and P5 are visualized in
Figure 7. Observing the images, it’s evident that the baseline CSPPAFPN model lacks sufficient feature extraction for the targeted objects. The heatmap points for features are relatively small, and there are instances of misalignment, with certain features undetected (such as the plane in the third column of the image). Conversely, our approach significantly enhances feature extraction, resulting in more prominent, clearer-shaped, and accurately positioned extracted features. This showcases the exceptional feature acquisition capability of our method, excelling in target differentiation, background noise suppression, and optimized feature extraction.
The following conclusions can be drawn from the experimental results: Compared to the baseline, our network effectively captures intricate features of smaller targets within complex backgrounds, enabling precise identification of fine-grained objects and mitigating classification errors. This illustrates that our network thoroughly considers feature and contextual information extraction, effectively eliminating background noise interference. During the feature fusion phase, the network enhances target features, enabling better discrimination of subtle differences within categories, consequently yielding superior results compared to the baseline. However, our network still encounters certain issues. For instance, in scenarios involving more ambiguous images with complex background noise, our model exhibits instances of missed detections and classification errors.
4.3.2. Results on SRSDD
The SRSDD serves as a dataset for detecting rotated objects amidst intricate backgrounds. There is a substantial variation in the quantities among different categories within this dataset, leading to a pronounced issue of data imbalance. Simultaneously, the dataset’s complex background contains considerable noise, posing significant challenges for detection. The majority of algorithms exhibit relatively low detection results, as evident from
Table 2. Our model has been compared against various state-of-the-art approaches on the SRSDD dataset, demonstrating superior performance with a 10.28% enhancement over the baseline. Specifically, our model achieves the best results in two categories: Ore-oil vessels, characterized by distinct features facilitating easier detection across algorithms, and Law-enforce vessels, which are scarce and usually poorly detected by most algorithms. Our model’s improvement in this category stems from its ability to capture distinct features and contextual information specific to Law-enforce vessels, enhancing accuracy due to their scarcity. Container vessels, often overlapping with onshore targets, pose significant interference, while their similarity to fishing vessels complicates their detection amidst high noise levels. Addressing this challenge remains a focal point for our future work. Overall, our method demonstrates commendable performance across most categories, achieving a notable overall accuracy of 66.28%. Nonetheless, there persist issues in our network, such as missed detections when numerous vessels are in proximity and classification errors for vessels with less distinct features.
Figure 8 showcases a segment of the detection outcomes, highlighting the proposed method’s adeptness in accurately detecting objects within complex backgrounds despite these aforementioned challenges.
From
Figure 8, it’s evident that our model detects targets more accurately compared to the baseline. Within the same image, MFCANet can detect and correctly classify nearshore vessels amid complex coastal backgrounds. This capability stems from MFCANet utilizing the FFCA Module to extract rich contextual feature information. Subsequently, the Feature Context Information Enhancement Module amalgamates and enhances multiscale features, significantly boosting the model’s ability to focus on global information. Simultaneously, it’s observable from the figure that our network still exhibits instances of misclassification and missed detections. Nevertheless, despite these limitations, our model surpasses the current state-of-the-art. We aim to address these issues of missed detections and misclassification by refining our network for optimal performance.
4.3.3. Results on HRSC
The HRSC dataset encompasses vessels with high aspect ratios navigating in different directions, posing significant challenges for precise target localization. Our proposed model showcases robust capabilities in feature extraction, emphasizing global information within the feature maps and effectively identifying class-specific features, resulting in exceptional performance. As illustrated in
Table 3, our method has achieved remarkable performance, securing evaluation scores of 90.48% and 97.84% for the VOC2007 and VOC2012 benchmarks, respectively.
Figure 9 displays the visual outcomes of implementing our method on the HRSC dataset. From the images, it’s apparent that compared to the baseline, our model can more accurately identify results. For instance, in the first row, the second column, and the third column, the baseline incorrectly identifies the object as a vessel, whereas our model adeptly avoids this misidentification. Similarly, when correctly identifying an object, our model expresses higher confidence in the identification. In the case of the last row where the vessel is not recognized, it might be due to the image cropping that retains only a small portion of the vessel, hindering the model from effectively extracting the vessel’s features.
4.4. Ablation Study
4.4.1. Ablation study with different feature fusion methods in MFFM
To deeply analyze how the original features are enhanced during the fusion process with PAFPN features, we conduct an ablation experiment focusing on the skip connections within the Multi-Feature Fusion Module (MFFM).
Figure 3 displays skip connections of different colors utilized as modules for the ablation experiment, specifically identified as red and orange. We compare how original features fuse with PAFPN in contrast to the baseline RTMDet on the MAR20 dataset. The experimental results, as depicted in
Figure 10, indicate that solely incorporating the yellow skip connection leads to a slight improvement. This could be attributed to the yellow skip connection primarily operating in the middle layer, responsible for fusing the original features, while the other two layers simply replicate the original features. Better results are observed when employing both multi-scale feature fusion methods simultaneously, notably enhancing detection accuracy. This improvement can be attributed to the effective re-fusion of original features with the already fused ones via the red skip connection, compensating for previously overlooked features and thereby enhancing the overall outcome.
4.4.2. Ablation study on ODCLayer modules.
For a comprehensive understanding of the enhanced functionality of our proposed ODCLayer module (
Figure 4), we conduct an ablation experiment involving the components within the ODCLayer module. Specifically, we employ 3x3 and 5x5 ODConv kernels as individual sets and perform ablation experiments using sets of three, four, and five such combinations. Furthermore, we conduct ablation experiments with and without channel attention. The results from the ablation experiments, as shown in
Figure 11, demonstrate that employing four sets of ODConv with the addition of attention achieves optimal performance. Analyzing the outcomes in
Figure 11 leads to the following observations: When the set count "Number" equals 3, the features are incompletely integrated, resulting in suboptimal aggregation of contextual feature information and consequently poor results. However, when the set count "Number" is 5, the outcomes degrade compared to "Number" 4, as it aggregates background and noise information during feature context fusion, leading to worsened results. Due to the diverse impacts of distinct channel weights on the outcomes, channel attention integration mitigates the adverse effects of specific channel information on the results. Consequently, incorporating channel attention further enhances the results when the set count "Number" is 4, yielding the most favorable outcomes.
4.4.3. Ablation study on MFCANet
To assess the efficacy of each proposed module, we compared the baseline with the individual enhancement modules using the MAR20 dataset, using RTMDet as the baseline for detection. The assessment primarily centers on the Average Precision (AP) and mean Average Precision (mAP) of standard object categories, such as A4, A5, A11, A13, A14, A15, A16, A18, and A20. Due to the similarity among fine-grained objects in remote sensing images and the complexity of backgrounds under various seasons and lighting conditions, their detection presents challenges.
Meticulous ablation experiments have been conducted on each enhancement module, and the results, presented in
Table 4, highlight the recognition outcomes for some particularly challenging targets. These experiments unequivocally show the effectiveness of the FFCA Module in significantly boosting the backbone network’s ability to extract features across various scales. Simultaneously, the ODCLayer module, employing a multidimensional attention mechanism and broader receptive fields through extensive kernel convolutions, adeptly captures comprehensive contextual information. This strategic approach effectively reduces background interference while enhancing the nuances of target features, thus increasing the model’s sensitivity to target identification. Furthermore, the skip connections network skillfully utilizes original feature information, preventing information loss during the fusion process. The synergistic interaction among these three modules vividly showcases the exceptional capability of our multi-scale feature context aggregation network.
Author Contributions
Conceptualization, Honghui Jiang, and Tingting Luo.; methodology, Honghui Jiang.; software, Honghui Jiang.; validation, Honghui Jiang., Tingting Luo. and Guozheng Zhang.; formal analysis, Honghui Jiang, Hu Peng, and Guozheng Zhang.; investigation, Honghui Jiang, and Hu Peng; resources, Hu Peng.; data curation, Honghui Jiang, and Tingting Luo.; writing—original draft preparation, Honghui Jiang, and Tingting Luo.; writing—review and editing, Honghui Jiang, and Tingting Luo.; visualization, Tingting Luo.; supervision, Honghui Jiang.; project administration, Honghui Jiang.; funding acquisition, Honghui Jiang. All authors have read and agreed to the published version of the manuscript.
Figure 1.
The fundamental macro-architecture comprises three segments: the backbone, neck, and heads. Input images are processed through the backbone network to extract features, resulting in three sets of feature maps at varying scales. The neck section utilizes PAFPN for the bidirectional merging of these multi-scale feature maps before passing them to the head. In the head component, predictions encompass various aspects, including object category counts, boundary regression, and detected target rotation angles, derived from the input features.
Figure 1.
The fundamental macro-architecture comprises three segments: the backbone, neck, and heads. Input images are processed through the backbone network to extract features, resulting in three sets of feature maps at varying scales. The neck section utilizes PAFPN for the bidirectional merging of these multi-scale feature maps before passing them to the head. In the head component, predictions encompass various aspects, including object category counts, boundary regression, and detected target rotation angles, derived from the input features.
Figure 2.
The FFCA module is specifically designed to acquire multi-scale focal feature context information. C, H, and W respectively denote the channel, height, and width of the feature map. ’mean’ represents the tensor’s mean operation, ⊙ denotes tensor multiplication, and ⊕ signifies tensor addition.
Figure 2.
The FFCA module is specifically designed to acquire multi-scale focal feature context information. C, H, and W respectively denote the channel, height, and width of the feature map. ’mean’ represents the tensor’s mean operation, ⊙ denotes tensor multiplication, and ⊕ signifies tensor addition.
Figure 3.
The Multiscale Feature Fusion Network integrates intermediate and final outputs from PAFPN with the original output features using a red solid line as a residual connection. The fusion of intermediate-level information with deep-layer information is denoted by a deep yellow dashed line, employing a 1x1 convolutional kernel for channel dimension adjustment. The Fusion module, inherent in the baseline, is used for merging the concatenated features.
Figure 3.
The Multiscale Feature Fusion Network integrates intermediate and final outputs from PAFPN with the original output features using a red solid line as a residual connection. The fusion of intermediate-level information with deep-layer information is denoted by a deep yellow dashed line, employing a 1x1 convolutional kernel for channel dimension adjustment. The Fusion module, inherent in the baseline, is used for merging the concatenated features.
Figure 4.
The ODCLayer initiates by integrating input features through a 1-sized convolutional kernel. These integrated features are split into two segments. Successively, ODConv modules with kernel sizes of 3 and 5 are concatenated in series while preserving residual connections. This sequence repeats four times. Later, the other segment concatenates along the channel dimension. Finally, channel attention mechanisms assign diverse weights to distinct channels.
Figure 4.
The ODCLayer initiates by integrating input features through a 1-sized convolutional kernel. These integrated features are split into two segments. Successively, ODConv modules with kernel sizes of 3 and 5 are concatenated in series while preserving residual connections. This sequence repeats four times. Later, the other segment concatenates along the channel dimension. Finally, channel attention mechanisms assign diverse weights to distinct channels.
Figure 5.
The architectural components of MFCANet consist of crucial modules. Initially, we employ the FFCA module to replace the SPPFBottleneck in the backbone network, capturing multi-scale feature context information related to focal targets. Subsequently, utilizing the MFFM module enhances the utilization of original features, minimizing the loss of specific feature information during fusion processes. Finally, leveraging our designed ODCLayer maximizes the enhancement of cross-layer feature integration and extraction, considering information across various feature dimensions. Our improvements notably enhance the model’s detection capability within the context of remote sensing applications.
Figure 5.
The architectural components of MFCANet consist of crucial modules. Initially, we employ the FFCA module to replace the SPPFBottleneck in the backbone network, capturing multi-scale feature context information related to focal targets. Subsequently, utilizing the MFFM module enhances the utilization of original features, minimizing the loss of specific feature information during fusion processes. Finally, leveraging our designed ODCLayer maximizes the enhancement of cross-layer feature integration and extraction, considering information across various feature dimensions. Our improvements notably enhance the model’s detection capability within the context of remote sensing applications.
Figure 6.
The depicted image demonstrates the outcomes derived from our proposed approach on the MAR20 dataset, encompassing 20 distinct categories. The initial column portrays the dataset’s authentic annotations, while the second column displays the baseline results, and the third column exhibits our method’s outcomes. Each row corresponds to three sets of results for a single image. The rectangular boxes labeled A1 to A20 at the bottom signify the distinct colors representing respective category bounding boxes.
Figure 6.
The depicted image demonstrates the outcomes derived from our proposed approach on the MAR20 dataset, encompassing 20 distinct categories. The initial column portrays the dataset’s authentic annotations, while the second column displays the baseline results, and the third column exhibits our method’s outcomes. Each row corresponds to three sets of results for a single image. The rectangular boxes labeled A1 to A20 at the bottom signify the distinct colors representing respective category bounding boxes.
Figure 7.
Each image’s top row represents the output results of our method, while the second row showcases the baseline’s output results. The first column corresponds to the real image, and the subsequent columns, from the second to the fourth, display the output features from the P3, P4, and P5 levels of the pyramid. Blue denotes background, while red and yellow indicate highlighted responses of that specific feature part.
Figure 7.
Each image’s top row represents the output results of our method, while the second row showcases the baseline’s output results. The first column corresponds to the real image, and the subsequent columns, from the second to the fourth, display the output features from the P3, P4, and P5 levels of the pyramid. Blue denotes background, while red and yellow indicate highlighted responses of that specific feature part.
Figure 8.
We have presented a sequence of detection outcomes obtained by our proposed MFCANet on the SRSDD dataset. These outcomes emphasize MFCANet’s capability to accurately extract target features despite complex backgrounds near coastal and marine areas, ultimately yielding precise results. The initial column portrays the dataset’s authentic annotations, while the second column displays the baseline results, and the third column exhibits our method’s outcomes. Each row corresponds to three sets of results for a single image. The rectangular boxes at the bottom, each in a different color, represent the bounding box colors corresponding to different categories.
Figure 8.
We have presented a sequence of detection outcomes obtained by our proposed MFCANet on the SRSDD dataset. These outcomes emphasize MFCANet’s capability to accurately extract target features despite complex backgrounds near coastal and marine areas, ultimately yielding precise results. The initial column portrays the dataset’s authentic annotations, while the second column displays the baseline results, and the third column exhibits our method’s outcomes. Each row corresponds to three sets of results for a single image. The rectangular boxes at the bottom, each in a different color, represent the bounding box colors corresponding to different categories.
Figure 9.
We display a subset of detection outcomes achieved using our MFCANet on the HRSC dataset. The initial column depicts actual images, the second column exhibits predictions from the baseline model, and the third column illustrates predictions from our model. Our approach demonstrates outstanding performance by producing precise and high-quality detection outcomes, especially in identifying densely clustered ships with challenging high aspect ratios.
Figure 9.
We display a subset of detection outcomes achieved using our MFCANet on the HRSC dataset. The initial column depicts actual images, the second column exhibits predictions from the baseline model, and the third column illustrates predictions from our model. Our approach demonstrates outstanding performance by producing precise and high-quality detection outcomes, especially in identifying densely clustered ships with challenging high aspect ratios.
Figure 10.
The line chart below illustrates the Baseline, Orange, and Orange+Red, representing the baseline result, the inclusion of yellow skip connections, and the simultaneous inclusion of yellow and red skip connections, respectively. The vertical axis indicates the mAP for each method on the MAR20 dataset.
Figure 10.
The line chart below illustrates the Baseline, Orange, and Orange+Red, representing the baseline result, the inclusion of yellow skip connections, and the simultaneous inclusion of yellow and red skip connections, respectively. The vertical axis indicates the mAP for each method on the MAR20 dataset.
Figure 11.
From left to right, each bar in the bar chart represents combinations of three, four, and five sets of 3x3 and 5x5 ODCLayer configurations. The last column in the bar chart corresponds to the addition of channel attention to the configurations of four sets of 3x3 and 5x5 ODCLayers. The vertical axis represents the mAP of each method on the MAR20 dataset.
Figure 11.
From left to right, each bar in the bar chart represents combinations of three, four, and five sets of 3x3 and 5x5 ODCLayer configurations. The last column in the bar chart corresponds to the addition of channel attention to the configurations of four sets of 3x3 and 5x5 ODCLayers. The vertical axis represents the mAP of each method on the MAR20 dataset.
Table 1.
Detection Accuracy of Different Detection Methods on the MAR20 Dataset. The numerical value in black bold represents the maximum.
Table 1.
Detection Accuracy of Different Detection Methods on the MAR20 Dataset. The numerical value in black bold represents the maximum.
Method |
A1 |
A2 |
A3 |
A4 |
A5 |
A6 |
A7 |
A8 |
A9 |
A10 |
|
[64] |
82.6 |
81.6 |
86.2 |
80.8 |
76.9 |
90.0 |
84.7 |
85.7 |
88.7 |
90.8 |
|
Faster R-CNN [64] |
85.0 |
81.6 |
87.5 |
70.7 |
79.6 |
90.6 |
89.7 |
89.8 |
90.4 |
91.0 |
|
Oriented R-CNN [64] |
86.1 |
81.7 |
88.1 |
69.6 |
75.6 |
89.9 |
90.5 |
89.5 |
89.8 |
90.9 |
|
RoI Trans [64] |
85.4 |
81.5 |
87.6 |
78.3 |
80.5 |
90.5 |
90.2 |
87.6 |
87.9 |
90.9 |
|
RTMDet [18] |
87.7 |
84.0 |
82.5 |
77.4 |
77.7 |
90.7 |
90.5 |
90.0 |
90.5 |
90.6 |
|
Ours |
86.7 |
83.5 |
83.0 |
84.5 |
81.2 |
90.5 |
90.9 |
89.4 |
90.8 |
90.7 |
|
Method |
A11 |
A12 |
A13 |
A14 |
A15 |
A16 |
A17 |
A18 |
A19 |
A20 |
mAP |
[64] |
81.7 |
86.1 |
69.6 |
82.3 |
47.7 |
88.1 |
90.2 |
62.0 |
83.6 |
79.8 |
81.1 |
Faster R-CNN [64] |
85.5 |
88.1 |
63.4 |
88.3 |
42.4 |
88.9 |
90.5 |
62.2 |
78.3 |
77.7 |
81.4 |
Oriented R-CNN [64] |
87.6 |
88.4 |
67.5 |
88.5 |
46.3 |
88.3 |
90.6 |
70.5 |
78.7 |
80.3 |
81.9 |
RoI Trans [64] |
85.9 |
89.3 |
67.2 |
88.2 |
47.9 |
89.1 |
90.5 |
74.6 |
81.3 |
80.0 |
82.7 |
RTMDet [18] |
84.5 |
87.7 |
69.2 |
86.9 |
71.7 |
85.7 |
90.5 |
82.9 |
81.5 |
74.4 |
83.83 |
Ours |
85.7 |
88.3 |
78.1 |
88.9 |
76.1 |
88.2 |
90.4 |
88.5 |
83.8 |
79.8 |
85.96 |
Table 2.
Detection Accuracy of Different Detection Methods on the SRSDD Dataset. We utilize B1 to B6 to represent the six categories: Ore-oil, Fishing, Law-enforce, Dredger, Cell-Container, and Container. The numerical value in black bold represents the maximum.
Table 2.
Detection Accuracy of Different Detection Methods on the SRSDD Dataset. We utilize B1 to B6 to represent the six categories: Ore-oil, Fishing, Law-enforce, Dredger, Cell-Container, and Container. The numerical value in black bold represents the maximum.
Method |
B1 |
B2 |
B3 |
B4 |
B5 |
B6 |
mAP |
|
R-RetinaNet [75] |
30.4 |
11.5 |
2.1 |
67.7 |
35.8 |
48.9 |
32.73 |
|
[60] |
44.6 |
18.3 |
1.1 |
54.3 |
43.0 |
73.5 |
39.12 |
|
BBAVeectors [76] |
54.3 |
21.0 |
1.1 |
82.2 |
34.8 |
78.5 |
45.33 |
|
R-FCOS [77] |
54.9 |
25.1 |
5.5 |
83.0 |
47.4 |
81.1 |
49.49 |
|
Glid Vertex [78] |
43.4 |
34.6 |
27.3 |
71.3 |
52.8 |
79.6 |
51.50 |
|
FR-O [74] |
55.6 |
30.9 |
27.3 |
77.8 |
46.7 |
85.3 |
53.93 |
|
ROI [59] |
61.4 |
32.9 |
27.3 |
79.4 |
48.9 |
76.4 |
54.38 |
|
RTMDet(baseline) |
59.4 |
40.0 |
27.3 |
80.5 |
76.5 |
52.3 |
56.00 |
|
RBFA-Net [79] |
59.4 |
41.5 |
73.5 |
77.2 |
57.4 |
71.6 |
63.42 |
|
Ours |
66.2 |
31.4 |
94.8 |
81.8 |
73.0 |
50.5 |
66.28 |
|
Table 3.
Detection Accuracy of Different Detection Methods on the HRSC Dataset. The numerical value in black bold represents the maximum.
Table 3.
Detection Accuracy of Different Detection Methods on the HRSC Dataset. The numerical value in black bold represents the maximum.
Method |
Backbone |
mAP (07)(%) |
mAP (12)(%) |
[61] |
R-101 |
90.17 |
95.01 |
AOGC [68] |
R-50 |
89.80 |
95.20 |
MSSDet [62] |
R-101 |
76.60 |
95.30 |
[25] |
R-101 |
89.97 |
95.57 |
MSSDet [62] |
R-152 |
77.30 |
95.80 |
[60] |
R-101 |
89.26 |
96.01 |
DCFPN [25] |
R-101 |
89.98 |
96.12 |
RTMDet [18] |
CSPNext-52 |
89.69 |
96.38 |
Ours |
CSPNext-52 |
90.48 |
97.84 |
Table 4.
The table clearly shows that adding each module independently enhances the detection performance of the baseline model. This suggests that our methods facilitate aggregating features and their contextual information within the baseline model at their respective positions. Moreover, the combination of any two modules exceeds the detection results achieved by a single module, illustrating the mutual enhancement among our method modules. Remarkably, integrating all three modules simultaneously significantly improves the detection results. Although certain individual module methods exhibit minor decreases in specific categories compared to the baseline, these variations stem from the diverse focal points of the respective module methods. Overall, the collective integration of our module methods produces a significant enhancement.
Table 4.
The table clearly shows that adding each module independently enhances the detection performance of the baseline model. This suggests that our methods facilitate aggregating features and their contextual information within the baseline model at their respective positions. Moreover, the combination of any two modules exceeds the detection results achieved by a single module, illustrating the mutual enhancement among our method modules. Remarkably, integrating all three modules simultaneously significantly improves the detection results. Although certain individual module methods exhibit minor decreases in specific categories compared to the baseline, these variations stem from the diverse focal points of the respective module methods. Overall, the collective integration of our module methods produces a significant enhancement.
Baseline |
M1 |
M2 |
M3 |
A1 |
A2 |
A3 |
A4 |
A5 |
A6 |
A7 |
A8 |
A9 |
A10 |
|
✓ |
|
|
|
87.7 |
84.0 |
82.5 |
77.4 |
77.7 |
90.7 |
90.5 |
90.0 |
90.5 |
90.6 |
|
✓ |
✓ |
|
|
85.4 |
80.5 |
85.4 |
81.0 |
82.7 |
90.8 |
90.8 |
90.1 |
90.5 |
90.8 |
|
✓ |
|
✓ |
|
87.1 |
81.2 |
83.2 |
84.5 |
80.0 |
90.5 |
89.8 |
87.1 |
90.6 |
90.9 |
|
✓ |
|
|
✓ |
87.5 |
87.7 |
85.9 |
83.0 |
81.1 |
90.8 |
90.8 |
90.1 |
90.6 |
90.9 |
|
✓ |
✓ |
✓ |
|
84.6 |
85.3 |
88.9 |
85.9 |
79.2 |
90.7 |
90.5 |
87.6 |
89.2 |
90.9 |
|
✓ |
|
✓ |
✓ |
88.7 |
84.7 |
84.3 |
85.1 |
81.5 |
90.6 |
90.1 |
90.4 |
90.6 |
90.8 |
|
✓ |
✓ |
|
✓ |
88.3 |
85.0 |
89.9 |
87.3 |
83.1 |
90.8 |
90.5 |
89.4 |
90.7 |
90.9 |
|
✓ |
✓ |
✓ |
✓ |
86.7 |
83.5 |
83.0 |
84.5 |
81.2 |
90.5 |
90.9 |
89.4 |
90.8 |
90.7 |
|
Baseline |
M1 |
M2 |
M3 |
A11 |
A12 |
A13 |
A14 |
A15 |
A16 |
A17 |
A18 |
A19 |
A20 |
mAP |
✓ |
|
|
|
84.5 |
87.7 |
69.2 |
86.9 |
71.7 |
85.7 |
90.5 |
82.9 |
81.5 |
74.4 |
83.83 |
✓ |
✓ |
|
|
82.8 |
85.3 |
72.9 |
85.9 |
72.7 |
88.1 |
90.4 |
84.4 |
81.8 |
74.4 |
84.32 |
✓ |
|
✓ |
|
85.0 |
88.8 |
68.3 |
88.2 |
63.8 |
87.1 |
90.4 |
86.9 |
83.8 |
79.8 |
84.35 |
✓ |
|
|
✓ |
83.1 |
84.7 |
78.7 |
88.5 |
69.9 |
87.5 |
90.4 |
84.8 |
83.4 |
79.2 |
85.42 |
✓ |
✓ |
✓ |
|
83.6 |
89.6 |
69.8 |
88.6 |
61.3 |
87.3 |
90.5 |
86.4 |
83.4 |
76.8 |
84.51 |
✓ |
|
✓ |
✓ |
85.3 |
88.3 |
72.5 |
88.6 |
71.0 |
88.9 |
90.4 |
88.0 |
82.9 |
79.3 |
85.61 |
✓ |
✓ |
|
✓ |
85.1 |
88.6 |
71.6 |
86.2 |
73.9 |
88.7 |
90.5 |
82.9 |
83.8 |
78.6 |
85.79 |
✓ |
✓ |
✓ |
✓ |
85.7 |
88.3 |
78.1 |
88.9 |
76.1 |
88.2 |
90.4 |
88.5 |
83.8 |
79.8 |
85.96 |