Aiming to achieve efficient ship detection in SAR images, we innovatively design an efficient anchor-free detector, namely ESarDet.
Figure 1 depicts the working flow of the proposed ESarDet. Due to complicated computation and sample imbalance caused by anchor-based methods, we chose the latest lightweight, universal anchor-free object detection model YOLOX-tiny [
18] as the baseline model. Taking into account the complex backgrounds, large scale variation, small-scale targets, and limited computational resources, three modules are proposed for the baseline model to optimize ship detection performance. First, to improve the detection of small ships in SAR images, we propose CAA-Net, which can effectively fuse context and semantic information. Second, to prevent losing the semantic information of ship targets at the bottom layer and improve detection efficiency, A2SPPF is designed to replace the SPP in YOLOX-tiny. Lastly, aiming to better detect multi-scale ships, we propose a new convolution block named A2CSPlayer, to better fuse feature maps of different scales. In the section that follows, the main components of the ESarDet will be described in detail.
3.1. CAA-Net
YOLOX-tiny applies cross-stage partial network (CSPNet) as the backbone, which can enhance the networks’ capacity for learning and reduce memory costs. However, due to the stacking of multiple small kernel convolutions, the ERF of CSPNet is small, which makes it challenging to capture context information of ships. In addition, the network extracts more semantic information but retains less context information as the layer’s number of network increases. To address these problems with CSPNet, a context attention auxiliary network (CAA-Net) is proposed to enhance the network’s ability to extract context information and expand ERF.
Figure 2 depicts the working process of the proposed CAA-Net.
The proposed CAA-Net contains the two path network to process the input feature map
. In the proposed CAA-Net, path A contains the
depthwise separable convolution (DSConv) block [
25]. Path B contains three parts that are
convolution block, coordinate attention (CA) module [
26], and
convolution block. Subsequently, the results of the two paths will be concatenated and reshaped via
convolution and SoftPool[
27] to obtain the output of CAA-Net.
Most networks expand the receptive field by stacking convolutions with small kernel sizes. However, stacking small convolutions does not effectively increase the effective receptive field [
28,
29]. The large ERF can help the network better extract the context information of ships, especially small ones. The equation of the ERF is:
where
is a standard deviation which indicates the size of ERF,
is roughly a Gaussian with mean and variance,
denotes the Gaussian model of
. Moreover,
m represents the pixel point in the kernel,
k represents kernel size, and
n denotes the convolution layers.
The ERF of a convolution is proportional to its kernel size and the square root of the number of layers, as demonstrated by the equation. It can be concluded that using a large kernel convolution expands the ERF more effectively than increasing the depth of small convolution layers. The use of large kernel convolution not only expands the effective receptive field, but also enhances its ability to extract context information of ships. Therefore, a convolution is added to the proposed CAA-Net, expanding the ERF and increasing the extraction of small ship context information from SAR images.
Nevertheless, convolution with a large kernel size has low computational efficiency. In addition, large kernel convolution makes it challenging to extract local features, which play a crucial role in ship detection. We introduce DSConv to increase the computational efficiency and performance of large convolutions to mitigate the aforementioned issues, and the equation of DSConv is as follows:
where
denotes the pointwise convolution operation, and
denotes depthwise convolution. Moreover,
represents the feature map’s concatenate operation.
Different from conventional convolution, DSConv decouples the spatial information and cross-channel information of the input feature map. DSConv employs depthwise convolution (DWC) to process the input channel by channel and then concatenates these feature maps, merging them into an output. However, using only DWC to process feature maps may cause a loss of cross-channel information. Thus, the pointwise convolution (PWC) is designed, in which convolution is introduced to cope with the cross-channel information. After the whole process mentioned above, a new feature map is generated. Compared to conventional convolution, DSConv significantly reduces the model’s computational cost.
Aiming to balance the context information extracted by the large kernel convolution in path A, we add a shortcut path, path B, to CAA-Net. In path B, the input is first processed via a
convolution block, which can prevent the network’s overfitting and increases the generalization ability. Additionally, the
convolution block can deepen the neural network and add more nonlinear information to help extract more features. Moreover, We introduce the CA module, a lightweight attention module, to path B of CAA-Net to better balance the context information extracted in path A and enhance the network’s capacity to extract ship location data from SAR images.
Figure 3 depicts the working process of the CA module.
In particular, the CA module contains two main steps, coordinate information embedding and coordinate attention generation, which can encode channel relations and long-range relations. The input
X is first compressed by X and Y global average pooling to
and
, respectively. After that, the two feature maps are concatenated together. The concatenated results are reshaped to
via a
convolution block (
r = 16 in this paper). The reshaped result is subsequently divided into two distinct feature maps. The two feature maps are transformed into
and
via two additional
convolution and sigmoid functions. Finally, combining the output feature maps into a weighting matrix, the input feature map
X is multiplied by two weighting matrices to refine the weights. The CA module’s operational flow can be summarized as follows:
where
W and
H are the width and height of the input feature map, and
denote the results of X Avg Pool and Y Avg Pool, respectively.
,
represents convolution with kernel size
,
denotes the sigmoid activation function, and
represents the batch normalization operation.
Paths A and B’s processed feature maps will be concatenated into a new feature map whose size is . The feature maps, respectively, from path A and path B, will be concatenated into a new feature map whose size is . Then the feature map will be reshaped by a convolution to .
To fuse with the feature map extracted from CSPNet, SoftPool is introduced to downsample the feature map to
, and its operation flow is depicted in
Figure 4. Conventional pooling operations, such as maximum and average pooling, result in the loss of semantic information of the feature map, which affects the precision of SAR ship detection. Unlike conventional pooling operation, SoftPool downsamples the feature map by using softmax of regions, producing normalized results that preserve more semantic information. The forward process of SoftPool can be summarized as follows:
where
R denotes the kernel size of the SoftPool,
e represents the natural exponent,
denotes the input feature map,
is the weights of
, and
is the final output activation map.
Finally, in order to fuse the context information extracted by CAA-Net with the semantic information extracted by backbone, an efficient feature fusion module is proposed. Its structure is depicted in
Figure 5. The process of the proposed feature fusion module can be summarized as follows:
where
denotes the CBAM attention module,
represents a convolution block, which is consists of a dilated convolution with a kernel size of
, batch normalization, and the SiLu activation function.
The feature maps extracted via CAA-Net and the backbone will first be concatenated in the feature fusion module. Then, the concatenated result will be reshaped via a
DSConv block. To better merge semantic information with contextual information, the convolutional block attention module (CBAM) [
30], a mixed attention module, is subsequently applied to refine the feature map. The CBAM module’s operating principle can be summarized as follows:
where
and
are spatial attention and channel attention modules.
is multilayer perceptron.
and
denote global average pooling (GAPool) and global max pooling (GMPool), respectively. Meanwhile,
and
stand for average pooling (APool) and max pooling (MPool), respectively.
In CBAM, the input feature map will first be calculated via the channel attention submodule. In this submodule, two attention maps are obtained via GAPool and GMPool, respectively. After that, two attention maps are refined independently via a two-layer multilayer perceptron (MLP) and merged by summing the refined feature map. In addition, to normalize the merged results, the sigmoid activation function is also introduced. Finally, to obtain the results of the channel attention submodule, the input is multiplied with the attention map. Then the spatial attention submodule processes the refined feature map. The feature map, which is processed by the channel attention submodule, is first processed by APool and MPool, respectively. After that, the two feature maps are concatenated and reshaped via a convolution. As with the channel attention submodule, sigmoid activation functions are also applied to normalize the attention map. The CBAM module’s final result is generated by multiplying the feature map with the attention map extracted by the spatial attention submodule.
3.2. A2SPPF
YOLOX-tiny introduces SPP [
31] in the backbone to remove the fixed-size constraint of the network. As shown in
Figure 6(a), in SPP, the input feature map is parallel processed via the three max pooling operation. The three independently processed feature maps are concatenated with a residual feature map and then reshaped via a
convolution block. However, the pooling operations in SPP result in the loss of the ship’s semantic information in SAR images. In addition, the parallel processing of three feature maps in SPP leads to low computational efficiency. Moreover, SPP cannot extract the information in different channels well.
Inspired by [
32,
33,
34], we propose atrous attentive spatial pyramid pooling fast (A2SPPF), and its working flow is depicted in
Figure 6(b). In comparison to SPP, the designed A2SPPF employs a serial operation to improve computational efficiency. Moreover, the proposed A2SPPF replaces the max pooling operation with dilated convolutions with different dilate rates and kernel sizes to expand the ERF and prevent the loss of detailed information in the feature map. Dilation rates of these three dilated convolutions are
, and their kernel sizes are
. We also introduce the efficient channel attention (ECA) module, a lightweight attention module [
35], to refine the weights. The structure diagram of the ECA module is depicted in
Figure 7. The ECA operating principle can be summarized as follows:
where
denotes a 1D convolution with kernel
k, and in this paper
.
represents the sigmoid activation function. The ECA module obtains a
feature map via GAPool. A 1D convolution and a sigmoid function are subsequently applied to obtain the attention map. Lastly, feature maps will be refined by multiplying them with the relevant channels of the input.
Three feature maps, which are processed via the ECA module, will be concatenated with the residual feature map. At the end of the proposed A2SPPF, the results will be reshaped via a convolution to obtain the final output.
3.3. A2CSPlayer
How to efficiently merge the different scale features extracted from the backbone is an important issue for detecting multi-scale ships in SAR images. YOLOX-tiny introduces PAFPN, in which CSPlayer can effectively merge the feature maps from different scales. The CSPlayer increases the network’s depth by stacking
convolutions, and the bottleneck structure raises the network’s computational efficiency. However, CSPlayer has a small ERF. In addition, it is also challenging for CSPlayer to effectively extract the features of small ships that are scattered in different channels. To achieve a more effective fusion of features from different scales, we propose the A2CSPlayer to process the concatenated feature maps. The architecture of the proposed A2CSPlayer is depicted in
Figure 8.
The proposed A2CSPlayer contains two branches. In path A, a
convolution block and an ECA module will refine the input feature map
to extract the small ship features scattered in multiple channels. Then the feature map will be split into two parts. In one part, two convolutions with respective kernel sizes of
and
will be applied to process the input feature map. Then, the feature map of this part will be added to the residual part to generate the final result of path A. The convolution operation of path A can be formulated as:
In path B, the input feature map
is processed via the proposed dynamic dilated depthwise separable convolution (
), which has a larger ERF than conventional convolution, and the convolution operation can be expressed as:
The designed
is shown in
Figure 9. To expand ERF and raise the computational efficiency, we first combine DSConv with dilated convolution. To expand the ERF, we substitute the PWC convolution process in DSConv with a dilated convolution. However, feature maps of varying scales have varying widths and heights, and the context information they contain varies in scale. To improve the extraction of context information, we establish a mapping relationship between the dilation rate of
and the width
and height
of the input image and the width
and height
of the current feature map.
The proposed mapping relationship should meet both of the following conditions. (1) The dilation rate
increases proportionally with the size of the feature map; (2) to prevent the loss of long-range information due to the large dilation rate, the dilation rate
of the proposed
should be constrained. The proposed mapping relationship is shown below:
where
is floor operation. In this paper, to meet the two previous conditions above,
k and
b are set to 3 and 0.2, respectively.
Table 1 shows the relationship after calculation between input image size and current feature map size and dilation rate.
After the operation above, the two feature maps obtained from paths A and B will be concatenated first. Finally, to obtain output of A2CSPlayer a
convolution block is used to process the feature map. The operation is as follows: