1. Introduction
Synthetic aperture radar (SAR) is an imaging radar with high range and azimuth resolution, which is widely used in the field of military and civilian due to its all-day and all-weather imaging capabilities. Target detection and segmentation are important parts of SAR image understanding and analysis. As the main transport carrier and effective combat weapon, automatic ship detection and segmentation provide important support for protecting maritime inviolable rights and maintaining maritime military security. Therefore, it is of great significance to carry out research on ship detection and segmentation in SAR images.
Most of the current ship detection methods [
1,
2,
3] are based on the conventional object detection framework to achieve ship detection. These methods provide the position information of the bounding box covering the target, but they do not provide detailed contour information of the target. Target segmentation refers to segment the target of interest in images at the pixel level, which simultaneously provides position information and contour information of the target. Hence, ship segmentation is treated as a more accurate and comprehensive means to achieve ship detection.
The segmentation algorithms based on active contour are popular in the field of image segmentation, including the improved K-means active contour model [
4,
5], the entropy-based active contour segmentation model [
6], and the Chan-Vese model [
7]. The Hidden Markov model is a commonly used method for image segmentation, which is a two-level structure model consisted of an unobservable hidden layer and an observable upper layer. Clustering analysis technology is also widely used to solve this issue, such as multi-center clustering algorithm [
8], fast fuzzy segmentation [
9], adaptive fuzzy C-means algorithm [
10], and the bias correction fuzzy c-means algorithm [
11]. As for SAR image segmentation, the most representative methods are the image segmentation algorithms [
12,
13,
14] base on the constant false alarm rate (CFAR) detector [
15], in which a threshold is determined based on the statistical characteristics of each image, and the image is segmented by comparing the gray level value of each pixel against the threshold value. CFAR-based methods consider pixel contrast information, while ignoring the structural features of the target, which leads to speckle noise in the segmentation results, incorrect target localization, and a large number of false alarms.
With the rapid development of deep learning technology, the convolutional neural network (CNN) has achieved excellent performance in the field of image processing [
16,
17,
18,
19], such as image classification [
16], object detection [
17], and object segmentation [
18,
19]. Many mature deep learning methods have also been put forward in the fields of SAR image processing. For example, Henry et. al [
20] presented fully convolutional neural network for road segmentation in SAR images and enhanced the sensitivity toward thin objects by adding the spatial tolerance rules. Bianchi et. al [
21] explored the capability of deep learning models in segmenting the snow avalanches in SAR images at a pixel granularity for the first time. Deep learning-based methods effectively improve the performance in data-intensive tasks, where a large amount of data is required to train the deep models. However, the performance of the deep learning methods is limited or even ineffective when the available training dataset is relatively small. Besides, the deep learning-based methods have insufficient generalization ability in the task of SAR image processing due to the large imaging area and various imaging characteristics of the SAR images. Specifically, most of these methods have superior performance in the source domain data, but their performance is degraded in the target domain. Therefore, how to solve the problem of SAR ship segmentation on the cross-domain small dataset is still an extremely challenging task.
Transfer learning is a commonly used strategy in the cross-domain tasks by transferring knowledge learned from a source domain with sufficient training data to a target domain lacking training data. However, a certain scale of target domain data is still required to achieve better results. Such requirement is still a burden in SAR image processing because it is expensive and time-consuming to collect SAR images and provide the label, especially in the task of SAR segmentation where the pixel-level ground truth is needed. Few-Shot Learning (FSL) has the ability to learn and generalize from a small number (one or several) of samples, which provides a feasible solution to the above problem. As a typical few-shot learning framework, Meta-learning is borrowed from the way humans learn a new task. Human rarely learn from scratch but learn based on the experience gained from the learning process of the related tasks when they learn a new skill. Meta-learning, also known as learn to learn, is proposed based on this learning mechanism of the human brain. The purpose of Meta-learning is to learn from previous learning tasks in a systematic and data-driven way to obtain a learning method or meta-knowledge, so as to accelerate the learning process of new tasks [
22]. Therefore, meta-learning framework is applied to solve the problem of SAR ship segmentation on the small cross-domain dataset.
The distribution of ship data in different regions is quite different due to various imaging modes, imaging resolutions, and imaging satellites. Ship segmentation in SAR images in different regions is considered as tasks originating from different domains. In this paper, a multi-scale similarity guidance few-shot network titled MSG-FN is proposed for ship segmentation in heterogeneous SAR images with few labeled samples in the target domain. The proposed MSG-FN adopts a dual-branch network structure including a support branch and a query branch. The support branch is used to extract the features of a specific domain target with a single encoder structure, while the query branch utilizes a U-shaped encoder-decoder structure to segment the target in the query image. These two branches share the same parameters in the encoder part, and the encoder is composed of well-designed residual blocks combined with filter response normalization (FRN). A similarity guidance module is designed to guide the segmentation process of the query branch by incorporating the pixel-wise similarities between the features of support objects and query images. Four similarity guidance modules are deployed between the support branch and the query branch at various scales to enhance the detection adaptability of targets with different scales. In addition, a challenging ship target segmentation dataset named SARShip-4i is built by ourselves to evaluate the proposed ship segmentation network, which includes both offshore and inshore ships.
The key contribution of this paper are as follows.
A multi-scale similarity guidance few-shot learning framework with a dual-branch structure is proposed to implement ship segmentation in heterogeneous SAR images with few annotated samples;
A residual block combined with FRN is designed to improve the generalization capability in the target domain, which forms the encoder of the support and query branches for domain-independent features extraction;
A similarity guidance module is proposed and inserted between two branches at various scales to perform hand-on-hand segmentation guidance of the query branch by pixel-wise similarity measurement;
A ship segmentation dataset named SARShip-4i is built, and the experiment results on this dataset demonstrate the proposed MSG-FN has the superior ship segmentation performance.
The remainder of this paper is organized as follows. The previous related works are briefly described in
Section 2. The proposed MSG-FN is presented in detail in
Section 3. Experimental results and analysis are demonstrated in
Section 4. Finally, the conclusion is made in
Section 5.
2. Related Work
2.1. Semantic Segmentation
Semantic segmentation is a classic problem in the field of computer vision, which aims at the pixel-level classification of images, providing a foundation for subsequent tasks of image scene understanding and environment perception. The deep learning method firstly applied to image semantic segmentation is patch classification [
23], in which the image is cut into blocks and fed into the depth model, and then the pixels are classified. Subsequently, the Fully Convolutional Network (FCN) [
24] was developed, which removes the original fully connected layer and converts the network to a fully convolutional model. The speed of FCN is much faster than that of the Patch classification method and the FCN method does not require the fixed size of the input image. However, the linear interpolation decoding method in the FCN leads to the loss of structure information and the obtained boundary is relatively coarse despite the fact that some skipping connections are used. SegNet [
25] is proposed to solve this problem by introducing more skipping connections and replicating the maximum pooled index. Another issue of the FCN model for semantic segmentation is the unbalance between the scale of the receptive field and the resolution of the feature map. The pooling layer enlarges the receptive field, but the resolution is reduced due to the down-sampling operation of the pooling layer, thus weakening the position information that semantic segmentation needs to preserve.
To keep the trade-off between the scale of the receptive field and the resolution of the feature map, the dilated convolutional structure and the encoder-decoder structure were proposed. Fisher et al. [
26] designed a dilated convolutional network to realize semantic segmentation, which increases the respective field without decreasing the spatial dimension. U-Net [
27] is a typical encoder-decoder structure, the encoder gradually reduces the spatial dimension of the pooling layer, and the decoder recovers the details and spatial dimension of the target step by step. Besides, there is a skip connection between the encoder and the decoder so that shallow features can assist in recovering the details of the target. Furthermore, RefineNet [
28] was proposed based on U-Net, which exploited all the information available along the down-sampling process and used long-range residual connections to enable high-resolution prediction. In this way, the fine-grained features in the early convolution are used to refine the high-level semantic features captured by the deeper layers.
Accurate segmentation of targets with different scales is the focal and difficult issue of semantic segmentation. In order to achieve this goal, semantic segmentation methods need to integrate the spatial features of different scales to achieve the accurate description of multi-scale objects. A simple idea is to use the image pyramid [
29], in which the input image is scaled into different sizes, and then the final segmentation result is obtained in an integrated way. In addition to the image pyramid, most of the current methods focus on how to make effective use of low-level features and high-level features. It is believed that the low-level features include rich location information, which is particularly important for accurate positioning, while the high-level features contain abundant semantic information, which is of great benefit to fine classification. In [
30], a multi-scale context-aggregated module called the pyramid pooling module (PPM) was introduced, which uses different large-scale pooling kernels to capture global context information. On the basis of this work, Chen et al. [
31] proposed an atrous spatial pyramid pooling (ASPP) module via replacing the pooling and convolution in PPM with the atrous convolution. Subsequently, the DenseASPP [
32] was proposed to generate features with more various scales in a larger range by combining the advantages of parallel and cascade expansion convolution.
The above methods work well on large-scale natural images, but the performance of these algorithms decreases when the amount of training data is small. As for SAR images, the number of SAR images collected in a scene is limited due to the special imaging mode of the SAR images. Besides, the amount of labeled SAR images that can be used to train the segmentation model is small because that the pixel-level labeling of SAR images is time-consuming and laborious. Therefore, how to use the knowledge learned in other scenes to make predictions with few training data is an urgent problem worthy of consideration.
2.2. Few-shot Learning
Few-shot learning is a learning paradigm proposed to solve the problem of small-scale training data, which refers to learning from a limited number of instance samples with supervised information. The proposal of few-shot learning is drawn lessons from the rapid learning mechanism of the human brain, that is, human beings quickly learn new tasks by using what they have learned in the past. The amount of training data determines the upper limit of the algorithm's performance. If a small-scale dataset is used to train a complex deep neural network in the traditional way, the over-fitting problem is inevitable. Due to the little demand for the well-annotated training data, FSL has attracted wide attention and has been adopted in various image processing tasks, such as, image classification [
33,
34,
35], semantic segmentation [
36,
37,
38], and object detection [
39,
40].
FSL aims at obtaining good learning performance given limited training samples. Specifically, given a learning task and a dataset which is consists of a training set and a test set. The number of training samples in the training set is relatively small, usually less than or equal to 5. The training set is also called support set, and the test set is also called query set. Suppose these is a theoretical mapping function satisfied between the input and the corresponding label. The purpose of few-shot learning is to find an approximate optimal mapping function in the mapping space
by learning from other similar tasks, so as to achieve accurate prediction on the test set. Few-shot learning is mainly reflected in the numbers of the samples in the support set, which is the number of well-annotated samples required when learning a new task.
Taking the most classic task of image classification as an example. The training set contains training data belonging to different categories. is the number of categories contained in the training set, is the number of images corresponding to each category, and the number of the training samples is. This kind of few-shot learning is called N-way K-shot learning. In particular, it is called one-shot learning when.
2.3. Few-shot Semantic Segmentation
Currently, there are some researchers try to use few-shot learning to achieve image semantic segmentation. The most widely adopted technical route is to use the guidance information in the support set, and guide the segmentation of the target in the query set by cleverly designing the network structure. The generally adopted network is a double-branch structure, as shown in
Figure 1. The support image and its corresponding label are fed into the support branch to provide guidance for the query branch, and then the prediction result of the query image is obtained. From the perspective of the way to achieve guidance, the existing few-shot segmentation methods can be divided into three types [
36], namely, matching-based methods [
37,
38], prototype-based methods [
41], and optimization-based methods [
42].
The typical matching-based method is SG-One [
37], in which a similarity-guided one-shot semantic segmentation network was proposed. SG-One uses the dense pairwise feature to measure the similarity and a specific decoding network to generate segmentation results. On this basis, CANet [
38] adds a multi-level feature comparison module to the dual-branch network structure, and improves the segmentation performance through multiple iterations of optimization.
The prototype-based methods extract the global context information to represent each semantic category, and use the overall prototype of the semantic category to match the query image at the pixel level. PANet [
41] learns class-specific prototype representations by introducing prototype alignment regularization between the support branch and the query branch. Both the prototype-based methods and the matching-based methods use a metric-based meta-learning framework to compare the similarity between support image and query image.
The optimization-based methods regard the few-shot semantic segmentation problem as a pixel classification problem. There are few related works. Among them, MetaSegNet [
42] uses global and local feature extraction branches to extract meta-knowledge and integrates linear classifiers into the network to deal with pixel classification problems. MetaSegNet mainly focuses on
N-way K-shot (N>1) problem to realize the multi-objective segmentation problem.
The above methods mainly focus on the few-shot semantic segmentation of natural images. In the field of SAR image processing, it is not feasible to directly use the few-shot segmentation method in natural image scenes because of the large distribution difference of SAR images under different imaging conditions. Therefore, we remodel the few-shot segmentation method of SAR images and propose a multi-scale similarity guidance network to achieve ship segmentation in heterogeneous SAR images with limited annotation data.
4. Experiment
4.1. SARShip-4i Dataset
This paper aims at ship segmentation in SAR images under the condition of few annotated samples in the target domain, and the proposed MSG-FN should be evaluated on the few-shot ship segmentation dataset consisting of SAR images. However, there is no SAR dataset available so far to evaluate the performance of the few-shot ship segmentation algorithms. Therefore, we built a SAR dataset named SARShip-4i with reference to current COCO-20i [
45] and Pascal-5i [
46] datasets used for few-shot natural image segmentation to evaluate the proposed MSG-FN method for few-shot ship segmentation.
The SAR images in SARShip-4i dataset consist of two parts, one is the self-collected SAR images, whose segmentation labels are provided by our pixel-by-pixel manual annotation, and the other is the SAR images in the dataset HRSID [
47], whose segmentation labels is generated based on the segmentation polygons provided in HRSID. SARShip-4i dataset contains 140 high-resolution SAR images from different imaging satellites and polarization methods, with resolutions ranging from 0.3 meters to 5 meters. The detailed information of the SAR images in the SARShip-4i dataset is shown in
Table 1.
The high-resolution SAR images are cropped to several image patches and rescaled to the same size of 512×512, and there is a total of 6961 image patches in the SARShip-4i dataset. As mentioned in
Section 3.1, SAR ship segmentation in different regions are treated as different segmentation tasks. The meta-training set and meta-testing set are set as SAR data from different regions considering different imaging modes and regional factors, and there is no intersection between the regions of SAR data used in the meta-training set and those predicted in the meta-testing set. The cross-validation strategy is applied here to evaluate the proposed MSG-FN. The SAR image patches in the SARShip-4i dataset are divided into 4 folds according to imaging regions, as shown in
Table 2. In each fold, the SAR image patches in a fold form the meta-testing set, and SAR image patches in other three folds form the meta-training set. To the best of our knowledge, SARShip-4i is the first dataset can be used to evaluate the few-shot ship segmentation methods in the SAR images.
4.2. Implementation Details
In the setting of few-shot ship segmentation in the SAR images, the training process is carried out in a meta-learning manner, and the fundamental unit for training and testing is the episode. Each episode is composed of a support set and a query set. Each support set consists of several image patches, for example, the support set contains 5 image patches in the 1-way 5-shot, and the query set contains one image patch in this paper. Before training and testing, the image patch in the dataset should be organized into episode-based data. That is, an episode is generated by randomly selecting several image patches as a support-query pair, and it is necessary to ensure that three are no duplicate image patches between the support set and the query set in an episode.
The backbone of the proposed MSG-FN is selected as the lightweight ResNet-18. Because of the large difference between the SAR images and natural images, the parameter pre-trained on large-scale natural image datasets such as ImageNet or COCO cannot be used to initialize our model, and our model is trained from scratch. In the training phase, the network is optimized with stochastic gradient descent (SGD), the batch size is set as 3, and the momentum and weight decay are set as 0.9 and 0.0001, respectively. The learning rate linearly increases from 0 to 0.001 in the first 2000 steps and then decays exponentially to 300000th steps with the decay rate of 0.9. The network is implemented using PyTorch, and all networks are trained and tested on NVIDIA GTX 1080 GPUs with 8GB Memory.
4.3. Evaluation Metrics
There are four evaluation metrics used to evaluate the performance of the proposed MSG-FN,
i.e., Precision, Recall, F1, and intersection over union (IoU). Precision and Recall are a pair of contradictory evaluation matrices, neither of which can fully measure the segmentation performance. F1 is a more comprehensive evaluation criteria, which maintains a trade-off between Precision and Recall. IoU is used to measure the degree of overlap between the segmentation result and the ground truth. These four evaluation metrics are calculated as follows.
where,
is the number of categories of the target to be segmented, and
is set as 1 here, because that only ship is the target in this paper.
is the number of pixels that are inferred to belong to class
with the ground truth of class
. In other words,
,
and
represent the numbers of true positives, false positives and false negatives, respectively.
4.4. Comparison with the State-of-the-arts
The proposed MSG-FN is evaluated against some state-of-the-art few-shot semantic segmentation methods under two experimental settings, namely, 1-way 1-shot, and 1-way 5-shot. 1-way 1-shot means that only one annotated support image is used to guide the ship segmentation when making predictions on the query image of the unseen test data, and 1-way 5-shot refers to using five support images to guide the segmentation of the query image. In the setting of 1-way 5-shot, the final segmentation result
is the average ensemble of the predicted masks generated with the guidance from the 5 support images, which is calculated as follows.
where,
is the predicted semantic label of the pixel at
corresponding to the support image
.
There is no work specifically designed for few-shot SAR ship segmentation, and thus we modify the state-of-the-art few-shot semantic segmentation approaches on natural images to fit our settings for algorithm comparison. In the experiments, the training and testing settings specified in
Section 4.1 are adopted. The experimental results of the proposed MSG-FN and three comparison methods on the settings of 1-way 1-shot and 1-way 5-shot are shown in
Table 3 and
Table 4, respectively.
The segmentation results on the four folds as well as the mean results are given in
Table 3 and
Table 4. We pay more attention to the mean results, which comprehensively evaluate the performance of the segmentation algorithm. The performance of the proposed methods achieves the best in terms of precision, F1, and IoU under the settings of both 1-way 1-shot and 1-way 5-shot. The recall of the proposed MSG-FN is a little bit lower than that of the PMMs [
48]. This is because Precision and Recall are contradictory, that is, if Precision is high, Recall is low, and vice versa. These two metrics cannot comprehensively measure the segmentation performance compared with F1 and IoU. In particular, the F1 and IoU of our method are 74.35% and 63.60% on the setting of 1-way 1-shot, and 74.37% and 63.62% on the setting of 1-way 5-shot. The results on the setting of 1-way 5-shot are better than that on the setting of 1-way 1-shot because there are more images in the target domain are provided on the setting of 1-way 5-shot.
The segmentation results on several samples are presented in
Figure 6 to visually illustrate the superiority of the proposed MSG-FN. The first three rows in
Figure 6 are samples of ship segmentation in the off-shore scenes, and the last three rows are samples in the inshore scenes. The first and second columns are the SAR image and the ground truth of ship segmentation, and the third to sixth columns are the segmentation results of the SG-One [
37], PMMs [
48], RPMMs [
48], and the proposed MSG-FN methods, respectively. It is obvious that the segmentation results of the proposed MSG-FN are more consistent with the ground truth compared with other three methods. SG-One [
37] has missing segmentation for some small-scale ship targets as shown by the dashed yellow circles in the third and fifth rows. Meanwhile, there are many false alarms in the segmentation result of SG-One in complex inshore scenes, as shown by the dashed red circles in the fourth and sixth rows. The reason for the above phenomenon is that SG-One uses only a single-scale guidance module, and its applicability to ship targets of various scales is inferior to our method. The segmentation results of the PMMs [
48] and RPMMs [
48] in the off-shore scene are similar to that of our method because the background in the off-shore scene is relatively simple. As for the inshore ship segmentation with a more complex and changeable background, there are many false alarms appearing in the segmentation results of PMMs and RPMMs, as shown by the dashed red circles in the fourth row and sixth row. This is because PMMs and RPMMs use a simple up-sampling interpolation method in the decoder part, while the proposed MSG-FN utilizes a U-shaped encoder-decoder structure. In conclusion, the experiments demonstrate that the segmentation results of the proposed MSG-FN are superior than other state-of-the-arts in terms of both quantitative metrics and qualitative visualization.
4.5. Analysis of the Learning Strategy
In this section, we analysis three kinds of learning strategies, which are typically used to migrate models from the source domain to the target domain, to validate the effectiveness of the few-shot strategy used in the proposed MSG-FN. The comparison results are reported in
Table 5. U-Net [
27] and PSPNet [
30] are two classic segmentation methods, where the model trained on the source domain is directly used to perform inference on the target domain. U-Net (TL) and PSPNet (TL) are applied with transfer learning strategy, which utilize 40% of data from the target domain to fine-tune the model trained on the source domain. MSG-FN (1-shot) and MSG-FN (5-shot) are the proposed few-shot methods on the settings of the 1-way 1-shot and 1-way 5-shot.
As reported in
Table 5, the performance of segmentation is poor when the trained model is directly used for prediction in the target domain. The performance has improved after utilizing the transfer learning strategy because a large amount of target information is learned to narrow the gap between the source domain and the target domain. Although transfer learning brings performance improvement, this strategy requires that a certain number of annotated samples are available for training in the target domain, which is not feasible in practical applications. It is noted that the proposed few-shot MSG-FN method has achieved the best performance. This is because the few-shot MSG-FN obtains meta information about each domain data over series of episodes training and it utilizes meta information for prediction on unseen data. Besides, the amount of the required labeled data in target domain has been greatly reduced compared with the transfer learning methods. The experiment results have verified that the few-shot learning strategy in the proposed MSG-FN is effective to solve the problem of semantic segmentation of SAR images with few labeled training data available in the target domain.
4.6. Ablation Study
In this section, the ablation experiments are carried out to verify the effectiveness of the two main modules in the proposed MSG-FN. The fourth fold SARShip-4
3 defined in
Table 2 is selected randomly to perform the ablation study under the setting of 1-way 1-shot. The results of the ablation experiments are shown in
Table 6. W/o Res-FRN represents a simplified version of MSG-FN, in which the Res-FRN block is replaced by the plain residual block. W/o Multi-scale SGM represents another simplified version of MSG-FN, in which the multi-scale similarity guidance module is removed and only a single similarity guidance module is deployed at the end of the encoder part.
Res-FRN block. As shown in
Table 6, the experiments demonstrate that MSG-FN with the proposed Res-FRN block achieves significant improvements by 10.84% and 12.71% in term of F1 and IoU over the W/o Res-FRN method, which indicates that Res-FRN block extracts the domain-independent features and has stronger generalization performance in target domains than the plain residual block.
Multi-scale SGM. In the proposed MSG-FN, multi-scale similarity guidance module is used to perform hand-on-hand guidance of support features to query features of various scales. Its performance has been improved by 0.34% and 0.61% in term of F1 and IoU compared to using a single similarity guidance module, which illustrates the adequacy of the hand-on-hand guidance, especially for the segmentation targets with various scales.