4.4.1. Comparison with State-of-the-art Methods
In this study, we conducted a series of comparative experiments on three datasets and achieved highly promising results. Our analysis primarily centered around evaluating the performance of various methods and techniques in addressing the given problem. The experiments were meticulously designed to compare the accuracy of different approaches and determine the most effective ones. The obtained results clearly indicate that our proposed method surpasses the other methods in terms of accuracy, as demonstrated in
Table 1 and
Table 2.
The tables present the Intersection over Union (IoU) values for each category and the mean IoU (mIoU) for all categories obtained with different models. Firstly, it is evident that the classic semantic segmentation networks, such as SegNet, FCN, and DeepLab, which were not specifically designed for remote sensing images, yield unsatisfactory results. Particularly for low vegetation/grass and trees, two easily confused targets, SegNet and DeepLab exhibit the lowest IoU, both below 70. Conversely, all models specifically tailored for remote sensing achieve an IoU higher than 70 for all categories. This can be attributed to the fact that these methods take into account the unique characteristics of remote sensing images. Notably, the proposed HRFNet in this study attains an IoU higher than 80 for nearly all categories, surpassing all other methods. Specifically, we achieve mIoU values of 86.47 and 83.31 on the two datasets, which are nearly 10 percentage points higher than the classic segmentation model FCN and the DeepLab v3+ network. When compared to DANet, LANet, and other models designed for remote sensing image segmentation, our method still demonstrates a 1-2 percentage point improvement in mIoU. Out of these improvements, our method shows a 1.5 percentage point enhancement for low vegetation/grass and trees. Moreover, improvements can also be observed to varying degrees for building and car categories. We attribute these improvements to the fusion of feature maps from different layers, as our method effectively captures local detail features such as edges outside the discriminative region of the target.
Quantitative analysis shows the performence of the model. In addition, we also perform a visual qualitative analysis of the segmentation effect of the model. The visualization of results in two datasets are shown in as shown in
Figure 8 and
Figure 9. Where the first column is the input image, the second column is ground truth, the middle four columns are the segmentation results of other methods, and the last column is our segmentation results. It can be seen that the HRFNet proposed in this article and the models designed specifically for remote sensing images, UNetFormer and DANet, have good segmentation results. Especially, HRFNet has excellent segmentation results on low vegetation/grass and car. For low vegetation/grass, HRFNet has clear edges and no obvious defects inside the target. As well as for car with clear edges and leaves little to be missed.
Furthermore, we extracted feature maps at different layers and visualized them, as illustrated in
Figure 10. It is evident that for the low vegetation/grass category, it predominantly occupies the significant areas, albeit with less clear edges. In certain locations, misclassification occurred where parts of other objects were mistakenly identified as low vegetation/grass, while in other areas, the object was not entirely encompassed. To address these issues, our method incorporates a fusion of feature maps from multiple layers, thereby maximizing the exploration of discriminative regions while preserving sharper edges.
4.4.2. Ablation Experiments and Analysis
In order to evaluate the performance of our proposed IRM and rich-scale intra-layer feature enhancement methods, taking into account computational resources and experimental efficiency, we conducted ablation experiments on the Vaihingen dataset.
Ablation experiments of IRM. The Intra-Region Mining (IRM) module within our proposed HRFNet quantifies the information content at different locations within the image. Leveraging this information, the subsequent intra-layer rich-scale feature enhancement method extracts features from different locations within each layer to fuse multi-scale context. To demonstrate the effectiveness of IRM, we conducted comprehensive experiments. Firstly, we designed ablation experiments to investigate the impact of the number of Rn (inter-layer feature fusion modules) on the segmentation results. When Rn=1, IRM is not utilized, and the 2-layer feature maps from DeepLab v3+ with Res2Net as the backbone are uniformly employed for the subgraphs. In this case, only Res2Net50 is used for multi-scale feature extraction as the baseline. Additionally, when Rn is 2 and 3, the corresponding layer 2 and 3 feature maps, and layer 2, layer 3, and layer 4 feature maps are utilized, respectively.
The experiments demonstrated that increasing the number of Rn and fused inter-layer feature maps has a significant positive impact on the segmentation results within a certain range. As shown in
Table 3, simply replacing the ResNet backbone network with Res2Net resulted in an improvement of nearly 1 percentage point in the segmentation results, highlighting the effectiveness of our HRFNet design. Specifically, by utilizing fine-grained multiple receptive fields for feature extraction from feature maps, IoU showed improvements across all categories. For instance, IoU increased by nearly 1 percentage point for invisible surfaces, Trees, and low vegetation/grass with significant differences in shape and range, and by 0.6 for buildings. Notably, there was a remarkable improvement in the extraction of dense small objects, achieving a 2.3 IoU with finer-grained receptive fields. Consequently, mIoU increased by more than 1 percentage point.
Furthermore, the experiments confirmed that our approach, inspired by Res2Net, effectively extracts multi-scale features from feature maps at a finer granularity. As shown in
Table 3, when the fused feature layers are fixed, the segmentation results improve to varying degrees with an increase in Rn. Specifically, when two-layer feature maps are used for feature fusion (Rn=2), differential feature extraction and fusion on two subgraphs with different levels of information lead to improvements compared to ordinary two-layer feature map fusion on the entire image. This approach involves dividing the image into two sub-images for different feature extraction. In comparison to using two-layer feature maps for feature extraction on the entire image using the original network, improvements were observed in buildings, cars, and low vegetation/grass, particularly in cars where IoU increased by nearly 0.6. Similarly, when three-layer feature maps were used and the image was divided into three sub-images for feature extraction, improvements in IoU were observed for impermeable surfaces, buildings, and cars, particularly in cars where IoU increased by nearly 0.7. Moreover, when four-layer feature maps were employed and the image was divided into four fine-grained subgraphs, the mIoU of feature extraction improved by nearly 0.6 and nearly 0.4 compared to the entire image and two subgraphs, respectively. Notably, in the case of cars, IoU increased by nearly 2 percentage points and nearly 2.4, while impermeability also increased by nearly 0.7 and 0.2, respectively. Interestingly, when Rn is 4, simply fusing the four-layer feature maps of the entire image yielded similar mIoU results as when Rn is 3 and different feature extraction and fusion are performed on the feature maps. This fully demonstrates the effectiveness of differential processing on different regions of the image.
Ablation experiments of IRFE. Our IRFE module consists of two main parts: Intra-layer Rich-scale Feature Extraction and Inter- and Intra-layer Feature Fusion. To evaluate the effectiveness of these components, we conducted separate ablation studies on inter-layer feature fusion and intra-layer feature fusion, as shown in
Table 4 and
Table 5, respectively. These experiments aimed to analyze the impact of each component on the segmentation results and demonstrate their effectiveness in enhancing the performance of our proposed model.
Firstly, in the evaluation of inter-layer feature fusion, we explored the impact of different fused feature maps by varying the values of Rn. The results are presented in
Table 4. We observed that feature maps from different layers contribute differently to the segmentation results. Through experimentation, we found that the 3rd and 4th layer feature maps had the most significant contribution, while the 1st and 2nd layer feature maps also showed some improvement. Specifically, when using the first two layers of feature maps, the mIoU on the Vaihingen dataset was 81.84. Replacing the first layer feature maps with the third and fourth layers respectively resulted in an improvement of nearly 1 point in the segmentation results. This improvement was approximately 0.5 for impervious surfaces, trees, and low vegetation/grass, and around 2 percentage points for buildings and cars. When the last two layers of feature maps were used, the mIoU increased by 1.1 compared to using only the first two layers of feature maps. For each of the five categories, the IoU for impervious surfaces increased by approximately 0.7, while buildings, trees, and low vegetation/grass increased by approximately 1 percentage point, and cars increased by nearly 2 percentage points. Finally, when all four layers of feature maps were utilized, the segmentation result achieved the highest mIoU of 83.31. In this case, the IoU for impervious surfaces increased.
In addition, for the evaluation of intra-layer feature fusion, we first utilized the final feature map of the subgraph and directly concatenated the intra-layer feature maps as the baseline. The results are shown in
Table 5. It was observed that the direct concatenation of subgraphs yielded the worst segmentation results, with discontinuous edges and even some targets not forming independent bounding boxes. However, after performing simple edge smoothing, the segmentation results improved. Furthermore, our proposed intra-layer feature fusion module significantly improved the results. This demonstrates the effectiveness of our proposed approach in fully preserving detailed information in feature subgraphs during stitching. Specifically, by simply smoothing the edges, the segmentation results for impervious surfaces improved by 0.25, the IoU for trees increased by nearly 0.7, and the results for cars and low vegetation improved by 0.44 and 46,. Ultimately, mIoU increased by 0.33. Surprisingly the results for the building category actually decreased. This may be due to mistakenly sliding a portion of the targets that belong to the building into non-building categories during edge smoothing, and vice versa. However, after incorporating our intra-layer feature fusion module, the mIoU increased by an additional 0.47. Compared to directly concatenating-layer feature maps, our method a total of .8 percentage points in mIoU. Notably, the increase in IoU for cars was the largest, with a surprising 2.23 improvement. Additionally, there was a 0.68 increase for low vegetation, 0.5 increase for trees, and 0.43 increase for impious surfaces Similar to the smoothing approach, the building category have been affected by incorrect feature maps, resulting in a decrease in the results.