3.2.1. FAM (Feature Adaptive Mixer)
Convolutional Neural Network (CNN) acts as a high-pass filter that can extract locally salient high-frequency information such as texture and detail [
36]. Self-Attention mechanism is a relatively low-pass filter that can extract salient low-frequency information such as global and smooth [
37]. Although the traditional pure convolution-based methods can effectively extract rich high-frequency features, they are unable to capture the spatial contextual information of the image. In contrast, methods based on purely self-attentive mechanisms tend to extract only the low-frequency information of the image, and also suffer from computational complexity and poor model generalization. Therefore, how to give full play to the advantages of these two computational paradigms has become a bottleneck for further breakthroughs in model feature extraction capability. From the ideas of information distillation and frequency mixing in image super-resolution reconstruction, we can get some insights. By mixing low-frequency features and high-frequency features, the model’s information flow and expression ability can be effectively enhanced [
38,
39].
Figure 7.
Qualitative comparison under ISPRS Vaihaigen(top), ISPRS Postdam(bottom) test sets. We add some black boxes to highlight the differences to facilitate model comparison.
Figure 7.
Qualitative comparison under ISPRS Vaihaigen(top), ISPRS Postdam(bottom) test sets. We add some black boxes to highlight the differences to facilitate model comparison.
Table 1.
Quantitative comparisons with existing methods were performed on the Vaihingen dataset. The best values in this column are highlighted in bold.
Table 1.
Quantitative comparisons with existing methods were performed on the Vaihingen dataset. The best values in this column are highlighted in bold.
Method |
Backbone |
Imp.surf |
Building |
Low.veg |
Tree |
Car |
MeanF1(%) |
OA(%) |
mIoU(%) |
FCN [40] |
VGG-16 |
88.2 |
93.0 |
81.5 |
83.6 |
75.4 |
84.3 |
87.0 |
74.2 |
DeepLabv3+ [41] |
ResNet-50 |
88.0 |
94.2 |
81.3 |
87.8 |
78.1 |
85.9 |
88.9 |
76.3 |
MAREsU-Net [42] |
ResNet-18 |
92.0 |
95.0 |
83.7 |
89.3 |
78.3 |
87.7 |
89.7 |
78.7 |
ABCNet [43] |
ResNet-18 |
92.7 |
95.2 |
84.5 |
89.7 |
85.3 |
89.5 |
90.7 |
81.3 |
BANet [44] |
ResT-Lite |
92.2 |
95.2 |
83.8 |
89.9 |
86.8 |
89.6 |
90.5 |
81.4 |
UNetFormer [45] |
ResNet-18 |
92.7 |
95.3 |
84.9 |
90.6 |
88.5 |
90.4 |
91.0 |
82.7 |
MANet [46] |
ResNet-50 |
93.0 |
95.5 |
84.6 |
90.0 |
89.0 |
90.4 |
91.0 |
82.7 |
Mask2Former [47] |
Swin-B |
92.9 |
94.5 |
85.3 |
90.4 |
88.5 |
90.3 |
90.8 |
83.0 |
DC-Swin [48] |
Swin-S |
93.6 |
96.2 |
85.8 |
90.4 |
87.6 |
90.7 |
91.6 |
83.2 |
FT-UNetFormer [45] |
Swin-B |
93.5 |
96.0 |
85.6 |
90.8 |
90.4 |
91.3 |
91.6 |
84.1 |
BAFormer-T |
ResNet-18 |
93.7 |
95.7 |
85.4 |
90.2 |
91.0 |
91.2 |
91.6 |
84.2 |
BAFormer |
ResNet-18 |
93.7 |
96.0 |
85.7 |
90.9 |
91.2 |
91.5 |
91.8 |
84.5 |
Table 2.
The Potsdam dataset was quantitatively compared with existing methods. The best values in this column are shown in bold.
Table 2.
The Potsdam dataset was quantitatively compared with existing methods. The best values in this column are shown in bold.
Method |
Backbone |
Imp.surf |
Building |
Low.veg |
Tree |
Car |
MeanF1(%) |
OA(%) |
mIoU(%) |
FCN [40] |
VGG-16 |
88.5 |
89.9 |
78.3 |
85.4 |
88.8 |
86.2 |
86.6 |
78.5 |
DeepLabv3+ [41] |
ResNet-50 |
90.4 |
90.7 |
80.2 |
86.8 |
90.4 |
87.7 |
87.9 |
80.6 |
MAREsU-Net [42] |
ResNet-18 |
91.4 |
85.6 |
85.8 |
86.6 |
93.3 |
88.5 |
89.0 |
83.9 |
BANet [44] |
ResT-Lite |
93.3 |
95.7 |
87.4 |
89.1 |
96.0 |
92.3 |
91.0 |
85.3 |
ABCNet [43] |
ResNet-18 |
93.5 |
95.9 |
87.9 |
89.1 |
95.8 |
92.4 |
91.3 |
85.5 |
SwinTF-FPN [49] |
Swin-S |
93.3 |
96.8 |
87.8 |
88.8 |
95.0 |
92.3 |
91.1 |
85.9 |
UNetFormer [45] |
ResNet-18 |
93.6 |
96.8 |
87.7 |
88.9 |
95.8 |
92.6 |
91.3 |
86.0 |
MANet [46] |
ResNet-50 |
93.4 |
96.7 |
88.3 |
89.3 |
96.5 |
92.8 |
91.3 |
86.4 |
Mask2Former [47] |
Swin-B |
98.0 |
96.9 |
88.4 |
90.7 |
84.6 |
91.7 |
92.5 |
86.6 |
FT-UNetFormer [45] |
Swin-B |
93.5 |
97.2 |
88.4 |
89.6 |
96.6 |
93.2 |
91.6 |
87.0 |
BAFormer-T |
ResNet-18 |
93.5 |
96.8 |
88.2 |
89.2 |
96.4 |
92.8 |
91.3 |
86.4 |
BAFormer |
ResNet-18 |
93.7 |
97.3 |
88.5 |
89.7 |
96.8 |
93.2 |
92.2 |
87.3 |
To enhance the accuracy of boundary identification, we propose a module called FAM. This method captures more accurate boundary features by enhancing the information flow and expressiveness of the model. It not only solves the single-scale feature problem, but also incorporates the idea of multi-branch structure to filter out important features from rich semantic information. Specifically, FAM includes three main parts: high-frequency branching, low-frequency branching, and adaptive fusion, as shown in
Figure 4. It aims to separate high-frequency features and low-frequency features in an image to capture local and global information of the image through the respective advantages of convolutional neural network and self-attention, and adaptively selects the fusion according to the contribution of channel fusion. Unlike traditional hybrid methods, we innovatively combine the high-frequency static affinity matrix extracted by convolution with the dynamic low-frequency affinity matrix obtained based on self-attention, which enhances self-attention’s ability to comprehensively capture high-frequency and low-frequency information and feature generalization. In addition, for the characteristics of these two computational paradigms, we carry out adaptive feature selection for multi-frequency mixing in the spatial domain, which can dynamically adjust the fusion effect according to the feature contribution.
The High-Frequency Branch is a simple and efficient module whose main function is to obtain local high-frequency features. Considering that high-frequency information can be obtained by a small convolutional kernel, we obtain local high-frequency feature information by concatenating 1x1 and 3x3 regular convolutions [
50]. To enhance the learning and generalization ability of self-attention, we designed to introduce the obtained high-frequency affinity matrix into the low-frequency affinity matrix, which is used to compensate for the lack of feature information of self-attention due to linear modeling. Let the input feature map be
. After confirming the 2D feature map by identity, the equivalent mapping, the feature map size is unchanged by the standard convolution with kernel sizes 1 and 3, and generates the high-frequency features
, the high-frequency affinity matrix
formulas are as follows:
Where
denotes the nonempty set and ⊗ denotes the matrix multiplication,
,
is the feature map obtained by
convolution, and
is the feature map obtained by
convolution, both of size
. Finally, through the matrix multiplication of
and
transpose, the high-frequency affinity matrix
is obtained according to the window size, where
N is the size of the partition window.
Low-Frequency Branch is a key part of capturing global contextual relationships, mainly using a multi-head self-attention mechanism [
51]. This method first expands the input feature map
in the channel dimension threefold through standard 1x1 convolution. After dividing into multiple heads, the window partition operation is applied to divide the 2D feature map into window sizes. Finally, the window feature map is flattened into a 1D sequence
. After partitioning the input text into windows of size
w and multiple heads of quantity
h, it is further divided into three feature vectors: Query(
Q), key(
K), and Value(
V), all with dimensions
. During the process of self-attention calculation, a learnable relative position encoding (Position Embedding, PE) is introduced to indicate the positional information of the image sequence. The low-frequency affinity matrix
generated through multi-head self-attention is combined with the high-frequency affinity matrix
to obtain the neutralized mixed affinity matrix
. Finally, after normalization through the sigmoid function, the normalized mixed affinity matrix is multiplied with
V to obtain the low-frequency feature map
after mixed linear weighting. The formula is described as follows:
where ⊕ denotes element level addition,
denotes the normalized activation function,
,
N is the size of the partition window,
is the learnable positional encoding of window size,
.
High-Low Frequency Adaptive Fusion is a fusion mechanism built on spatial feature mapping. Inspired by the feature rescaling of SK-Net [
52], the weights of the contribution values of the hybrid channel occupied by high-frequency features and low-frequency features are learned by designing different pooling methods, so that the network can pick a more appropriate multi-scale feature representation. Specifically, the obtained high-frequency feature
and low-frequency feature
are directly added together to fuse and obtain the mixed feature
. Then, the maximum pooling and average pooling are performed on this mixed feature to obtain the high-frequency attention feature map
and low-frequency attention feature map
, respectively. The two spectral features are connected at the channel level, and the standard convolution smoothing filter with a size of
is applied to obtain
. After
activation in the fusion dimension, the high-frequency attention feature map
and low-frequency attention feature map
are obtained, and they are individually weighted by element-wise multiplication on the high-frequency feature map
and low-frequency feature map
. Finally, the weighted feature map results are added together to obtain the output result of the adaptive fusion,
. The relevant formulas are as follows:
where
denotes global maximum pooling,
denotes global average pooling,
denotes channel-level splicing,
denotes the activation function, and
denotes convolution with a kernel size of
.
Figure 8.
Qualitative comparisons with different methods on the LoveDA validation set.
Figure 8.
Qualitative comparisons with different methods on the LoveDA validation set.
Table 3.
Quantitative comparisons were made between our method and existing methods on the LoveDA dataset. The best values in this column are displayed in bold.
Table 3.
Quantitative comparisons were made between our method and existing methods on the LoveDA dataset. The best values in this column are displayed in bold.
Method |
Backbone |
Per-Class IoU |
mIoU |
Background |
Building |
Road |
Water |
Barren |
Forest |
Agriculture |
|
FCN [40] |
VGG-16 |
42.6 |
49.5 |
48.1 |
73.1 |
11.8 |
43.5 |
58.3 |
46.7 |
DeepLabv3+ [41] |
ResNet-50 |
43.0 |
50.9 |
52.0 |
74.4 |
10.4 |
44.2 |
58.5 |
47.6 |
SemanticFPN [53] |
ResNet-50 |
42.9 |
51.5 |
53.4 |
74.7 |
11.2 |
44.6 |
58.7 |
48.1 |
FarctSeg [54] |
ResNet-50 |
42.6 |
53.6 |
52.8 |
76.9 |
16.2 |
42.9 |
57.5 |
48.9 |
TransUNet [55] |
Vit-R50 |
43.3 |
56.1 |
53.7 |
78.0 |
9.3 |
44.9 |
56.9 |
48.9 |
BANet [44] |
ResT-Lite |
43.7 |
51.5 |
51.1 |
76.9 |
16.6 |
44.9 |
62.5 |
49.6 |
SwinUperNet [56] |
Swin-Tiny |
43.3 |
54.3 |
54.3 |
78.7 |
14.9 |
45.3 |
59.6 |
50.1 |
DC-Swin [48] |
Swin-Tiny |
41.3 |
54.5 |
56.2 |
78.1 |
14.5 |
47.2 |
62.4 |
50.6 |
MaskFormer [57] |
Swin-Base |
52.5 |
60.4 |
56.0 |
65.9 |
27.7 |
38.8 |
54.3 |
50.8 |
UNetFormer [45] |
ResNet-18 |
44.7 |
58.8 |
54.9 |
79.6 |
20.1 |
46.0 |
62.5 |
52.4 |
BAFormer-T |
ResNet-18 |
45.9 |
57.9 |
58.2 |
79.0 |
19.0 |
47.3 |
61.4 |
52.7 |
BAFormer |
ResNet-18 |
44.9 |
60.6 |
58.6 |
80.4 |
21.3 |
47.5 |
61.5 |
53.5 |
3.2.2. RAF (Realtionship Adaptive Fusion)
To obtain richer boundary features, fusing feature maps of different scales is considered to be an effective method to improve image effects [
58]. Currently, the commonly used fusion methods include spatial numerical summation and channel dimensional splicing. However, shallow and deep features in the network do not play the same contribution in feature fusion. Generally, the shallow features have larger values and the deeper features in the network have smaller values, leading to differences in their spatial contributions. In addition, since shallow and deep features contain different semantic information, there is also some semantic confusion in the channel dimension. How improve the effect of feature fusion has become a new thinking direction to optimize the network performance. Inspired by the perceptual fusion of shallow and deep branches in ISDNet [
59], we propose a dynamic fusion strategy RAF based on relational perception. This module obtains more complete boundary information by improving the feature granularity, and its detailed structure is shown in
Figure 5.
Unlike other multi-scale static fusion methods, RAF can adaptively adjust the fusion of shallow and deep features according to the network task requirements and data characteristics by explicitly modeling the spatial and channel dependencies between features. While ensuring deep semantic transformation, it can fully use shallow features to achieve higher-quality feature reconstruction. Specifically, the method first learns the relational weight factors in the spatial dimension by modeling the spatial numerical differences between shallow and deep features through global flat pooling. Then, under the feature mapping of spatial modeling, the relationship weight factors on the channel dimension are learned by analyzing the channel relationship matrix after feature variation and compression sensing. Finally, the weighted fusion of features is performed under the dual relationship perception of space and channel. Let the shallow feature map
, and the deep feature map
, with
, RAF first aligns the spatial sizes of the shallow and deep feature maps, explicitly extracts the feature information, and obtains the two one-dimensional attention vectors
that contain the information of their respective channels. It is expressed by the formula as follows:
Where
denotes Global Average Pooling and
denotes spatially sampled twice. In the second step, spatial dependencies and channel dependencies are modeled step by step. The two one-dimensional attention vectors
,
are taken as global average pooling to obtain the spatial relationship weight factors
,
, which are expressed as follows in Equation:
When modeling the channel dependencies, considering that there are some differences between the channel semantics, the two one-dimensional attention vectors
,
are scaled to length r after the multilayer perceptron to obtain two contraction tensors
,
. Then, matrix multiplication is done based on these two contraction tensors to obtain the channel relevance matrix
, which is mapped by the straightening and the multilayer perceptron into a channel weight factor
,
containing only two values with the following formula:
Where
$ denotes the multi-layer perceptron,
R denotes the channel relation matrix, and
denotes the operation of straightening the relation matrix. In the third step, dynamic weight fusion. The spatial weight factor and the channel weight factor are summed up, and the weighted values
and
of the shallow feature map and the deep feature map are obtained by
in the first dimension, which is weighted and summed up with the dot-multiplication weighting of the shallow feature map
and the deep feature map
respectively to obtain the final fused feature map
, with the following formulae:
3.2.3. DWLK-MLP (Deep Wide Large Kernel Multi-Layer Perceptron)
Enhancing the convolutional perceptual field is an effective means to improve semantic segmentation [
60]. Recent studies have shown that the introduction of DW convolution into MLP multilayer perceptrons can effectively integrate the properties of self-attention and convolution, thus enhancing the generalization ability of the model [
61]. Compared to ordinary MLP [
51], DW-MLP [
62] with residual structure introduces a 3x3-sized DW convolution into the hidden layer. This approach is effective in aggregating local information, mitigating the effects of the self-attention paradigm, and improving the generalization ability of the model. However, due to the large number of channels in the hidden layer, a single-scale convolution kernel cannot effectively transform channel information with rich scale features. To solve this problem, a multiscale feedforward neural network MS-MLP [
63] has been proposed. He used DW convolution with kernel size [1,3,5,7] to capture multi-scale features. In this way, the performance of the model is enhanced to some extent. However, just using MLP to further transform the multi-scale features to enhance the generalization of the model is limited as it also undertakes the important task of extracting the feature maps for higher-level combination and abstraction.
To further improve the completeness of boundary features, we propose the simple and effective DWLK-MLP module as shown in
Figure 6. This module increases the convolutional receptive field by deeply separating the large kernel convolutions, and more complete boundaries can be extracted with almost no computational overhead. Unlike other methods, DWLK-MLP introduces the idea of large kernel convolution, which can take on more advanced abstract feature extraction tasks by creating a large kernel receptive field. Specifically, we introduce a deep large kernel convolution of 23x23 size in front of the activation function. The final result is obtained by summing up the initial feature map with the feature map after the large kernel convolution using jump concatenation. To reduce the number of parameters and computational complexity, we use two depth convolution sequences of 5x5 and 7x7 for decomposition. This approach exploits the lightweight nature of the depth-separable computational paradigm and promotes the fusion of self-attention and convolution to improve network generalization. Numerous experiments have demonstrated that the introduction of deep large kernel convolution before the activation function improves the accuracy and robustness of image recognition more than after the activation function.
Figure 9.
Qualitative comparisons with different methods on the Mapcup test set.
Figure 9.
Qualitative comparisons with different methods on the Mapcup test set.
Figure 10.
Inference visualization was performed in a randomly selected region in the north. Red represents Cropland and black represents non-Cropland. (a) High-resolution remote sensing large image. (b) Visualization of model inference.
Figure 10.
Inference visualization was performed in a randomly selected region in the north. Red represents Cropland and black represents non-Cropland. (a) High-resolution remote sensing large image. (b) Visualization of model inference.
Table 4.
Comparison of different methods on the Mapcup dataset.
Table 4.
Comparison of different methods on the Mapcup dataset.
Method |
Backbone |
CultivatedLand |
Non-Cropland |
Mean F1(%) |
OA(%) |
mIoU(%) |
FCN [40] |
VGG-16 |
81.6 |
86.8 |
84.2 |
84.7 |
72.5 |
DeepLabv3+ [41] |
ResNet-50 |
82.7 |
87.5 |
85.1 |
85.5 |
74.2 |
A2FPN [64] |
ResNet-18 |
83.2 |
87.8 |
85.5 |
85.9 |
74.8 |
ABCNet [43] |
ResNet-18 |
84.0 |
88.1 |
86.1 |
86.4 |
75.6 |
MANet [46] |
ResNet-50 |
86.0 |
89.3 |
87.7 |
87.7 |
78.1 |
BANet [44] |
ResT-Lite |
86.7 |
89.8 |
88.3 |
88.5 |
79.0 |
DC-Swin [48] |
Swin-S |
86.8 |
89.7 |
88.2 |
88.4 |
79.0 |
UNetFormer [45] |
ResNet-18 |
87.2 |
90.3 |
88.8 |
88.5 |
79.6 |
FT-UNetFormer [45] |
Swin-B |
88.7 |
91.0 |
89.8 |
90.0 |
81.6 |
BAFormer-T |
ResNet-18 |
88.2 |
90.8 |
89.5 |
89.6 |
81.0 |
BAFormer |
ResNet-18 |
89.8 |
91.7 |
90.7 |
90.8 |
83.1 |