1. Introduction
Extracting water regions in radar images has been an important field in remote sensing, given the fact that water segmentation has many real-world applications such as ice/snow/oceans monitoring and change detection (CD) in water in dams, lakes, and rivers. It is significant to map water regions promptly and accurately and provide the information about the water regions, especially when a natural disaster like flooding occurs. Nowadays Synthetic Aperture Radar (SAR) data are widely used for providing imagery for water mapping, as the radar technology offers customers weather condition independent, 24/7 available, and high-resolution satellite imagery.
Multi-temporal SAR imagery, where images of a region are made on several diffrent dates [
1], can utilize CD as a methodology for flood detection; it uses not only flood imagery but also its pre-/post-flood imagery as reference for flood mapping [
2,
3]. The behavior of surface water, which shows high temporal variability (the standard deviation of the backscattered intensities) and low minimum backscatter in time series, enables us to use a simple threstholding approach for segmenting water accurately [
4]. It is equally true, however, that multi-temporal imagery is not available to be used at any time; we need at least two images (ideally produced under the same conditions except for date) for water mapping. Single-date SAR imagery’s larger uncertainty leaves it behind multi-temporal in accurate extraction, but it is more economical of resources and time; it has therefore an advantage in a situation which needs fast action.
In general, electromagetic (plane) waves have a degree of freedom, e.g. a rotation around the propagation axis. This allows us to have linearly polarization which is eigher vertical (V) or horizontal (H). Considering both transmitted waves and received, we have combinations of V and H such as the vertical-horizontal (Vertical transmit and Horizontal received, VH) mode. Likewise we have VV, HV, and HH. It is thought that the HH mode is more suitable for distinguishing flood regions from non-flood [
5,
6]. The VV mode is more sensitive than the HH and cross-polarizations (VH, HV) to small-scale roughness of waves on the water surface. The VH component, which makes it easier for us to notice differences between a flood region and a non-flood, is more useful than the VV for detecting flood regions [
7,
8].
As to algorithms for land/water segmentation (river, lake, flood regions etc) in SAR imagery, thresholding-based approaches have been traditionally used [
5]. Its underlying assumption is that the backscattering amplitudes which differ between the Earth’s surface types allows us to classify the points of imagery by setting thresholds in the amplitude histogram. For instance, the surface layer backscatter coefficents of high moistured soil tend to be higher than those of low moistured soil [
9]; the amplitude of smooth surface such as calm water regions is low, while rough surface yields a higher amplitude. As a result, water regions are generally darker in SAR imagery [
1], and thus a threshold for land regions can be set. At the same time, the thresholds depend on several conditions, that is, wind, incident angle, polarization, speckle noize, and therefore segmentation results are affected by these conditions.
There have been various nonneuralnet machine learning (ML) approaches used for several data source: random forest for Sentinel-1 C-band [
10], K-means clustering for Sentinel-1A [
11], Fuzzy C-Means clustering for Gaofen-3 and Sentinel-1 [
12], Support Vector Machine (SVM) [
13], Markov Random Field (MRF) for TerraSAR-X [
14], XGBoost for Resourcesat-2 [
15]. In addition, several statistical models have been used for Radarsat-2 [
16], COSMO-Sky-Med [
17], and Sentinel-1 [
18].
Also, superpixel-based segmentation approaches have been applied for water mapping. In the superpixel framework, an image which consists of standard pixels is partitioned into groups according to algorithms, and each of the groups has statistically or morphologically meaningful information. In terms of image description, it could be argued, the superpixel representation maintains the same level of detail as the pixel representation, while the latter uses larger computational resources [
19]. The framework threfore has potential capability to organize data from imagery faster and focus on more relevent regions.
Superpixel-based approaches have a range of applications: change detection [
20], image segmentation [
21], automatic target recognition (ATR), extracting water regions and coastlines [
22], etc. For segmentation within the framework of superpixel, there has been a variety of superpixel methods especially for optical images; simple linear iterative clustering (SLIC), an edge-based SLIC (ESLIC) [
23], pixel intensity and location similarity (PILS) [
24], quick shift (QS) [
25], and turbo pixels (TP) [
26,
27]. When it comes to SAR imagery, mixture-based superpixel (MISP) method is robust to SAR imegery’s speckle noise, which facilitates more practical use of the model [
28].
Not only non-neuralnet ML models but neuralnet-based models have been applied for water mapping. CNN-based models especially U-Net models have shown good performances in typical evaluation metrics [
29] - besides, they have been applied to near-real-time flood detection because of their speed [
8].
In relation to encoder-decoder-type architecture, we have seen that the course of evolution of Seq2Seq [
30] into Transformer was directed by the alignment model [
31] (commonly known as the attention mechanism [
32,
33]). Compared to CNN and RNN, Transformer has smaller complexity per layer, minimum number of sequential operations, and maxium per lengths [
33]. The attention-based models, which make it easier to find distant relations between words in sentences, have shown a brilliant performance in Natural Language Processing and accordingly been widely used in the field. The attention-based models are used in not just NLP but Computer Vision subject. Vision Transformer (ViT) converts an image into a sequence of patches which are used for embeddings, together with positional encoding, so that it can leverage the contextual information about the image [
34].
The attention mechanism has been used for improving a neuralnet model’s human interpretability as well. For instance, real driving datasets were analyzed by constructing a visual attention model and computing attention maps to find which parts of the video image were cared by the model [
35]. For chest radiograph classification, an attention-based model outperformed GradCAM (gradient-weighted class activation mapping) – it was concluded that, compared to GradCAM, the attention-based model could be more helpful for radiologists in terms of decision making [
36]. Also, audio data were classified by an attention-based deep sequence model, where the attention scores were taken advantaged of to visualize which parts of the specrogram were cared [
37].
The objective of the present study is to propose novel segmentation schemes whereby contextual information about superpixels is utilized for the input of machine learning to perform land/water segmentation of SAR imagery. This study also assesses our model’s performances in land/water classification, applying our models for Sentinel-1’s flood data [
38].
The rest of this paper is organized as follows. Setcion 2 describes the satellite data and its data prosessing for the present study, and presents the method for generating and annotating superpixels. The section also provides schemes which are designed for unstructured/structured input for ML, showing performance indicators for land/water segmentation as well.
Section 3 describes the setting of parameters; the section does feature selection for the structured scheme, presenting the results of numerical experiments. The section also makes a qualitative analysis by using attention rollout scores.
Section 4 interprets the findings from
Section 3. The final section is devoted to our concluding remarks.
4. Discussion
We have seen that the neighborhood model generally outperforms the single model in both VV and VH modes for the MISP-SDT/XGB schemes. In the VV mode, the neighborhood model had better scores than the single model except that the recall values (0.459 in the unstructured scheme and 0.533 in the structured) for the single models were higher than those (0.425 in the unstructured and 0.514 in the structured, respectively) for the neighborhood models. These differences in recall between the single model and the neighborhood were mitigated in the VH mode: the recall values were 0.542 and 0.580 in the single models, compared to those (0.543 and 0.570) in the neighborhood models. That is, they are comparable for the unstructured scheme, and the gap was narrowed for the structured scheme in the VH mode.
As to the polarization, it has been found that the VV models underperformed the VH for both MISP-SDT and MISP-XGB, regardless of whether the model is single or neighbohood. These findings were in line with those of previous studies, supporting the notion that cross-polarized SAR imagery is more suitable for flood detection than the co-polarized. It was also observed that under the MISP-XGB, the results from the neighborhood model showed 6 neighbor features were ranked in the top 10 features in the VV mode; 9 neighbor features were ranked in the top 10 in the VH mode. It could be argued that the VH mode made it easier for the algorithm to utilize neighbor features for the classification, compred to the VV.
We have carried out the qualitative analysis as well; we now introduce a concept of Near Neighbor region to evaluate the results from the qualitative analysis more carefully. Roughly, the Near Neighbor region of a target superpixel is defined to be the vicinity of the target superpixel. More specifically, we define the circle area whose center is the centroid of the target region of a superpixel and the radius is equal to 1.1 times the semi major axis of the target region. Then, we define the Near Neighbor region of the superpixel by the intersection of the inside region of the circle area and the outside region of the target superpixel.
Figure 7 illustrates a few examples of part of SAR imagery (left subfigures), its target/neighbor superpixels (center), and corresponding Target/Near Neighbor/Far Neighbor regions (right). In the right subfigures, Target, Near neighbor, and Far Neighbor regions are filled with white, indigo, and gray, respectively.
Then, we calculated the average of the attention rollout scores for each region (i.e. Padding/Far Neighbor/Near Negihbor/Target), using the single, mask, and neighborhood models for the test VH data.
Figure 8 shows boxplots of the averages of attention rollout scores for Pad (1st subfigure), Far Neighbor (2nd), Near Neighbor (3rd), and Target (4th) regions, where attention rollout scores were min-max normalized on a scale of 0 to 1. In the figure, each subfigure shows the results from the single (red), mask (green), and neighborhood (blue) models. From the figure, it is apparent that for the neighborhood model the attention rollout scores in Target, Near Neighbor, and Far Neighbor regions are the highest, second highest, and third highest, respectively. This pattern emerges from the single and mask models.
It is noticeable that the score for Target region was the highest in the neighborhood model among the four regions – likewise the scores were in the mask and single models. As to the comparison between the three models, the neighborhood, single, and mask models had the highest, 2nd highest, 3rd highest scores, respectively for Target region, and this pattern is similar to those for Far Neighbor and Near Neighbor regions. It could be argued that these findings are related to the fact that the neighborhood model is given more information than the single and mask, and that the mask is not given information about the role of a superpixel.
It is thought that the neighborhood model gains an advantage in terms of segmentation because the model is provided more information about SAR and its stretch, and the roles of regions. These things suggest that the role channel has positive effects on land/water segmentation, allowing the neighborhood model to get an advantage. These findings suggest that contextual information about a target superpixel and its neighbor ones has overall positive effects on classification performanes of the algorithm.
5. Conclusions
We have devised novel superpixel schemes which use the mixture-based superpixel framework for single-date SAR imagery, incorporating the unstructured/structured data of both target and neighbor superpixels into the input for the ML algorithms. Designing the unstructured and structured schemes to annotate and segment target superpixels, we have designed MISP-SDT and MISP-XGB schemes, respectively. We have also developed single/mask/neighborhood models and single/neighborhood models for the unstructured and structured schemes, respectively. Accordingly, our schemes have enabled us to take advantage of contexual information about SAR imagery to segment superpixels.
Our schemes were applied to Sentinel-1 SAR data to examine segmentation performances. The results from the numerical experiments demonstrated that the cross-polarized mode produced better results than the co-polarized, which were consistent with previous studies. Under our MISP-SDT/XGB schemes, where we did not reply on Sentinel-2 optical imagery, the neighborhood models outperformed the FCNN model in IoU and mIoU.
As to models, the neighborhood model as a whole gave better performances than the single; this pattern of numerical results emerged regardless of whether the scheme was unstructured or structured.
We have used attention maps and feature importance scores, demonstrating that neighbor regions were looked at or used by the ML algorithms in the neighborhood models. Our findings suggest that under the unstructured/structured schemes contextual information provided by neighbor superpixels helps the ML algorithms to improve land/water classification performance. As part of future work, we mention that the interpretability of ML could be enhanced by further developing this kind of schemes. Moreover, it will be worthwhile to consider how this sort of schemes could be applied to different research fields and topics.
Figure 1.
The flow diagram of the data processing.
Figure 1.
The flow diagram of the data processing.
Figure 2.
An example of the input for the unstructured scheme, where the 1st, 2nd, and 3rd columns represent the SAR channel, stretched SAR channel, and role channel, respectively. The inputs for the single model (the 1st row), mask model (2nd), and neighborhood model (3rd) are shown together.
Figure 2.
An example of the input for the unstructured scheme, where the 1st, 2nd, and 3rd columns represent the SAR channel, stretched SAR channel, and role channel, respectively. The inputs for the single model (the 1st row), mask model (2nd), and neighborhood model (3rd) are shown together.
Figure 3.
Architecture of the MISP-SDT models and an example of input.
Figure 3.
Architecture of the MISP-SDT models and an example of input.
Figure 4.
The relations between the top x percent of the all features and the validation scores for the VV mode (red) and the VH (blue). The results from the single models (upper subfigures) and the neighborhood models (lower) are presented. The validation scores are mIoU (left subfigures) and IoU (right).
Figure 4.
The relations between the top x percent of the all features and the validation scores for the VV mode (red) and the VH (blue). The results from the single models (upper subfigures) and the neighborhood models (lower) are presented. The validation scores are mIoU (left subfigures) and IoU (right).
Figure 5.
The segmentation results for Somalia_94102 (top) and USA_504150 (bottom). Their SAR imagery (1st subfigure), segmentation results from the single model (2nd), the neighborhood (3rd), and the groundtruth (4th).
Figure 5.
The segmentation results for Somalia_94102 (top) and USA_504150 (bottom). Their SAR imagery (1st subfigure), segmentation results from the single model (2nd), the neighborhood (3rd), and the groundtruth (4th).
Figure 6.
Several examples of the sets of stretched SAR channel (the 1st column), role channel (2nd), attention map for single (3rd), attention map for mask (4th), attention map for neighborhood (5th), and class label map (the ground truth) (6th). As to the class label map, land, water, unknown regions are filled with green, blue, and black, respectively.
Figure 6.
Several examples of the sets of stretched SAR channel (the 1st column), role channel (2nd), attention map for single (3rd), attention map for mask (4th), attention map for neighborhood (5th), and class label map (the ground truth) (6th). As to the class label map, land, water, unknown regions are filled with green, blue, and black, respectively.
Figure 7.
A few examples of part of SAR imagery (left), its target/neighbor superpixels (center), and its Target/Near Neighbor/Far Neighbor regions (right). Target, Near Neighbor, and Far Neighbor regions are filled with white, indigo, and grey, respectively.
Figure 7.
A few examples of part of SAR imagery (left), its target/neighbor superpixels (center), and its Target/Near Neighbor/Far Neighbor regions (right). Target, Near Neighbor, and Far Neighbor regions are filled with white, indigo, and grey, respectively.
Figure 8.
The averages of attention rollout scores for the roles (Pad for 1st subfigure, Far neighbor for 2nd, Near Neighbor for 3rd, Target for 4th) of a superpixel, for the single, mask, neighborhood models.
Figure 8.
The averages of attention rollout scores for the roles (Pad for 1st subfigure, Far neighbor for 2nd, Near Neighbor for 3rd, Target for 4th) of a superpixel, for the single, mask, neighborhood models.
Table 1.
Breakdown of Sen1Flood11 dataset.
Table 1.
Breakdown of Sen1Flood11 dataset.
Type |
Sample size |
Train (Hand-labeled) |
252 |
Validation |
89 |
Test |
90 |
Bolivia |
15 |
Table 2.
Values/Choices of parameters under the unstructrued/structured schemes.
Table 2.
Values/Choices of parameters under the unstructrued/structured schemes.
Scheme |
Parameter |
Value/Choice |
]7*Unstructured |
Epochs |
80 |
|
Learning rate |
|
|
Batch size |
512 |
|
Optimizer |
AdamW |
|
Loss function |
Cross entropy |
|
Weight decay |
0.05 |
|
Dropout |
0.0 |
]9*Structured |
Boosting count |
1450 |
|
Learning rate |
0.0035 |
|
max depths of tree |
10 |
|
Early stopping round |
250 |
|
Objective function |
Binary logistic |
|
L1 regularization |
0.6 |
|
L2 regularization |
1.2 |
|
Tree method |
Approx |
|
Column subsample ratio |
0.55 |
Table 3.
Numerical results for the hand-labeled training data.
Table 3.
Numerical results for the hand-labeled training data.
Pol |
Scheme |
Model |
Accuracy |
Recall |
Precision |
F-1 |
IoU |
mIoU |
VV |
Unstr |
Single |
0.916 |
0.459 |
0.479 |
0.376 |
0.473 |
0.292 |
Mask |
0.888 |
0.578 |
0.362 |
0.364 |
0.438 |
0.274 |
Neighb |
0.926 |
0.425 |
0.541 |
0.403 |
0.515 |
0.309 |
Str |
Single |
0.908 |
0.533 |
0.444 |
0.391 |
0.462 |
0.300 |
Neighb |
0.923 |
0.514 |
0.479 |
0.411 |
0.513 |
0.319 |
VH |
Unstr |
Single |
0.926 |
0.542 |
0.468 |
0.431 |
0.565 |
0.338 |
Mask |
0.921 |
0.574 |
0.451 |
0.425 |
0.536 |
0.325 |
Neighb |
0.936 |
0.543 |
0.497 |
0.456 |
0.601 |
0.357 |
Str |
Single |
0.924 |
0.580 |
0.464 |
0.439 |
0.569 |
0.345 |
Neighb |
0.935 |
0.570 |
0.509 |
0.466 |
0.608 |
0.368 |
Table 4.
Comaprison to FCNN and AlbuNet-34 in terms of AW.
Table 4.
Comaprison to FCNN and AlbuNet-34 in terms of AW.
|
Metric |
FCNN (HL) |
FCNN (S1W) |
AlbuNet-34 |
|
MISP-SDT |
MISP-XGB |
|
IoU |
n/a |
n/a |
0.497 |
|
0.601 |
0.608 |
|
mIoU |
0.313 |
0.309 |
0.347 |
|
0.357 |
0.368 |