This section looks into the results derived from the performed systematic literature review. Using 106 extracted articles, the results are presented and discussed based on the RQs.
3.1. RQ1. What Are the Emerging Patterns in Land Cover Mapping?
The annual distribution varies from 2017 to 2023.
Figure 2 shows the number of research articles published annually from 2017 to 2023. The year 2017 saw a modest output of merely 3 articles. There is a surge in number of articles to 12 in 2020, 17 articles by the year 2021, 30 articles by 2022 and 40 articles by 2023. This observation aligns with the understanding that the adoption of deep learning semantic segmentation models on satellite imagery gained significant momentum in 2020 and subsequent years.
Figure 3 depicts number of articles published in academic journals of this domain. The top 13 journals produced over 81% of the number of research studies of semantic segmentation in land cover mapping. MDPI Remote Sensing (30) has the highest number of published articles in this domain, follow by IEEE Journal of selected topics in Earth Science (15), IEEE Transactions on Geoscience And Remote Sensing (12), IEEE Geoscience and Remote Sensing Letters (5), ISPRS Journal Of Photogrammetry And Remote Sensing (4) and so on, while 20 other journals have 1 article each published grouped as “other” category.
In terms of geographical distribution of studies extracted from Scopus database, 35 countries contributed to the study domain. Almost all continents have contributions except the African continent.
Figure 4 shows that China published 63 articles of the total 106 articles, the second is the United States with 10 articles, followed by India (6), Italy (5), South Korea (4), United Kingdom(4), Canada, Finland and Germany (3), then Austria, Australia, Brazil, France, Greece, Turkey, Netherlands and Japan (2) while the rest countries have 1 each distributed.
The significant keyword occurrences are obtained from the titles and abstracts of the extracted articles. The
Figure 5 shows relevant and leading keywords. A threshold of 5 was set, which means the minimum number of occurrences of a keyword. Only 68 out of 897 keywords met the threshold. Bibliometric analysis reveals that keywords such as "high-resolution RS images", "Remote Sensing", "satellite imagery" and "very high resolution" exhibit prominence, showing strong associations with neural network-related terms including "Semantic Segmentation", "Deep Learning,", "Machine Learning", and "Neural Network". The Semantic Segmentation has “attention mechanisms” and “transformer” as different model’s architectural component. These learning models are further linked to various application domains, evident in their connections to terms like "Land Cover Classification", "Image Classification", "Image Segmentation", "Land Cover", "Land Use “, “Change Detection” and "Object Detection." In 2020, the research revolved around network architectures, object detection and image processing. In later part of 2021, there was a notable shift in research domains, predominantly towards image segmentation, image classification, and land cover segmentation. In 2022 and 2023, there were pronounced shift in research focusing more to semantic segmentation employing satellite high resolution images for change detection, land use, and land cover classification and segmentation.
3.2. RQ2. What Are Domain Studies of Semantic Segmentation Models in Land Cover Mapping?
In this section, each extracted paper is clustered based on similar study domains areas.
Figure 6 shows the overall mind map of the domain studies. Land cover, urban, precision agriculture, environment, coastal areas and forest are mostly studied domain areas.
In
Figure 7, it is evident that among the 106 articles, 36 studies cover both land cover (LC) classification and segmentation, 24 specifically focus on LC classification, 15 concentrate on urban applications, 9 address LC segmentation alone, 8 addresses environment issues, 5 center around precision agriculture, 5 are oriented toward coastal applications, with 3 articles addressing LC change detection and 2 focusing on forestry.
In the 72 studies out of 106 related to land cover, research activities encompass land cover classification (33.3%), land cover segmentation (12.5%), change detection (4.2%), and the combined application of land cover classification and segmentation (50%). These areas of study are extensively documented and represent the most widely researched applications within land cover studies. Land cover classification involves assigning an image to one of several classes of land use and land cover (LULC), while land cover segmentation entails assigning a semantic label to each pixel within an image [
26]. Land cover refers to various classes of biophysical earth cover, while land use describes how human activities modify land cover. On the other hand, change detection plays a crucial role in monitoring LULC changes by identifying changes over time periods, which can help predict future events or environmental impacts. Change detection methods employing DL have attained remarkable achievements [
27] across various domains, including urban change detection [
4], agriculture, forestry, wildfire management, and vegetation monitoring [
28].
Figure 6.
Land cover mapping domain studies.
Figure 6.
Land cover mapping domain studies.
Figure 7.
Number of publications per domain studies.
Figure 7.
Number of publications per domain studies.
Among the 15 publications related to urban studies, 38% focus on segmentation applications, including urban scene segmentation [
19,
29,
30], while 31% address urban change detection for mapping, planning, and growth [
31,
32]. For instance, change detection techniques provide insights into urban dynamics by identifying changes from remote sensing imagery [
7], including changes in settlement areas. Additionally, these studies involve predicting urban trends and growth over time, managing land use [
3], monitoring urban densification [
33] as well as mapping built-up areas to assess human activities across large regions [
34].
Publications in urban studies also cover 6% in land survey management [
35] and 25% in urban classification and detection [
36,
37,
38,
39,
40] particularly in building applications and for Urban Land-Use Classification [
41].
Out of the 5 publications concerning precision agriculture studies, the research focuses on various aspects such as crop mapping [
42], identification, [
43,
44], classification [
45], and monitoring [
46]. For example, mapping large-scale rice farms [
5] monitoring crops to analyze different growth stages [
43], and classifying Sentinel data for creating an oil palm land cover map [
47].
Among the 8 environmental studies analyzed, the research spans various applications. These include 25% focusing on soil erosion applications [
48,
49], which involves rapid monitoring of ground covers to mitigate soil erosion risks. Additionally, 12.5% of the studies center around wildfire applications [
50] encompassing burned area mapping, wildfire detection [
51], and smoke monitoring [
52], along with initiatives for preventing wildfires through sustainable land planning [
8]. Another 12.5% of the studies involve haze classification [
53] specifically cloud classification using Sentinel-2 imagery. Climate change research [
54] accounts for another 12.5% of the studies, focusing on aspects such as the urban thermal environment. Furthermore, 12.5% of the studies are dedicated to vegetation classification [
55] and an additional 25% address mining applications [
56] including the detection of changes in mining areas [
57].
Research in this domain encompasses forest classification, including the classification of landscapes affected by deforestation [
58]. Moreso, change detection in vegetation and forest areas enables decision-makers, conservationists, and policymakers to make informed decisions through forest monitoring initiatives [
6] and mapping strategies tailored to tropical forests [
59].
Within this field, 5 studies out of the 106 articles focus on wetland mapping, classification, and segmentation [
10,
60,
61,
62,
63]. Although, the exploration and study of coastal area remote sensing image segmentation remains a relatively underexplored research area, as noted by [
61]. This challenge is primarily attributed to the significant complexities associated with coastal land categories, including issues such as homogeneity, multiscale features, and class imbalance, as highlighted [
64].
3.3. RQ3. What Are the Data Used in Semantic Segmentation Models for Land Cover Mapping?
In this section, the paper synthesizes extensively employed, particularly the study location, data source and benchmark datasets used for land cover mapping.
Figure 8 illustrates the countries where the study areas were located and where in-depth research was conducted among the extracted articles. Among the 22 countries represented in 51 studies, 35.29% of the study areas are located in China, with 11.76% in the USA, 5.88% in France, and 3.92% in Spain, Italy, Brazil, South Korea, and Finland each. Other countries in the chart each account for 1.96% of the study areas.
Table 1 presents the identified data sources along with the number of articles and their corresponding references. The data sources identified in literatures include RS satellites, RS Unmanned Aerial Vehicles (UAVs) and Unmanned Aircraft Systems (UAS), mobile phones, Google Earth, Synthetic Aperture Radar (SAR), and LiDAR sources. Among these, Sentinel-2, Sentinel-1, and Landsat satellites are the most frequently utilized data sources. It is important to recognize that the primary remote sensing (RS) technologies include RS satellite imagery, Synthetic Aperture Radar (SAR), and Light Detection and Ranging (LiDAR).
RS Imagery: RS data are the most extensively utilized in land cover mapping. In RS, data are collected from satellite sources such as sentinel-1, Landsat, sentinel-2, WorldView-2 and QuickBird at certain time step intervals for a period. The data capture products of these satellite include Panchromatic (1 Channel – 2D), Multispectral or Hyperspectral images [
1]. RSI can be represented as aerial images [
2], these are taken using Drone or UAVs. These data usually possess spatial resolution and spectral resolution of certain image sizes.
Synthetic Aperture Radars Among the radar systems used in Land Cover Mapping, SAR stands out as a notable data source [
83]. SAR utilizes radio detection technology and constitutes an essential tool in this field. SAR data carries distinct advantages, especially in scenarios where optical imagery faces limitations such as cloud cover or limited visibility. SAR can penetrate through cloud cover and offer earth surface imaging even in the presence of clouds or unfavorable weather conditions. This is one of the key advantages of SAR technology. Unlike electromagnetic spectrum which are obstructed by clouds [
84].
There are various types of SAR data harnessed for the purpose of land cover mapping, such as polarimetric synthetic aperture radar (PolSAR) images [
82], E-SAR, AIRSAR, Gaofen-3, RADARSAT-2 datasets [
85], GaoFen-2 data [
86], GF-2 images[
87] and Interferometric Synthetic Aperture Radar [
6]. At present, semantic segmentation of PolSAR images holds significant utility in the interpretation of SAR imagery, particularly within agricultural contexts [
88]. Similarly, the High-Resolution GaoFen-3 SAR Dataset is useful for the Semantic Segmentation of Building [
34,
89,
90]. The benchmark dataset Gaofen-3 (GF-3), comprised of single-polarization SAR images, holds significant importance [
91]. This dataset is derived from China's pioneering civilian C-band polarimetric SAR satellite, designed for high-resolution RS. Notably, FUSAR-Maps are generated from extensive semantic segmentation efforts utilizing high-resolution GF-3 single-polarization SAR images [
92], while GID dataset is collected from the Gaofen-2 satellite.
Light Detection and Ranging data (LiDAR): LiDAR holds a significant role within the sphere of land cover mapping and climate change [
93]. LiDAR involves the emission of laser pulses and the measurement of their return times to precisely gauge distances, creating highly accurate and detailed elevation models of the Earth's surface. It provides detailed information, including topographic features, terrain variations, and the vertical structure of vegetation. It stands as an indispensable data source for land cover mapping endeavors. Notable examples include the utilization of multispectral LiDAR [
55], an advanced RS technology merging conventional LiDAR principles with the capacity to concurrently capture multiple spectral bands. There's the Follo 2014 LiDAR data, a dataset that specifically captures Light Detection and Ranging (LiDAR) data in the Follo region during 2014. Additionally, the NIBIO AR5 (Norwegian Institute of Bioeconomy Research - Assessment Report 5) Land Resources dataset, developed by the Norwegian Institute of Bioeconomy Research, represents a comprehensive evaluation of land resources. This dataset encompasses a range of attributes including land cover, land use, and pertinent environmental factors [
94].
The benchmark datasets used for evaluation in this domain as identified in the review are shown in
Figure 9. ISPR Vaihingen and Potsdam are the widely used benchmark datasets, followed by GID, Landcover.ai, DeepGlobe and WHDLD.
The ISPRS Vaihingen comprises 33 aerial image patches in IRRGB format along with their associated digital surface model (DEM) data, each with a size of around 2500 × 2500 pixels at 9 cm spatial resolution [
95,
96]. Similarly, the publicly accessible ISPRS Potsdam dataset encompasses Potsdam city, Germany. It is composed of 38 aerial IRRGB images measuring 6000 × 6000 pixels each at a spatial resolution of 5 centimeters [
97].
The Global Imperviousness Dataset, GID dataset [
98] contains 150 images of GaoFen-2 data [
86], GF-2 images [
87]. The GaoFen-2 data and GF-2 images collectively form an integral component of the benchmark GID dataset, offering valuable insights into global imperviousness patterns and land cover characteristics. Every image is composed of pixels measuring 6908 × 7300 and compose of the R, G, and B bands, each with a spatial resolution of 4 meters. The GID dataset consist of 5 land-use categories: farmland, meadow, forest, waters, built-up [
99].
The LandCover.ai dataset [
100] comprises images chosen from aerial photographs encompassing 216.27 square kilometers of Poland, a country in central Europe. The dataset includes 41 RS images, with 33 images having an approximate resolution of 25 cm, measuring around 9000 × 9000 pixels, and 8 images with a resolution of approximately 50 cm, spanning about 4200 × 4700 pixels [
12,
17,
101]. The dataset was manually categorised into 4 types of objects such as buildings, woodland, water as well as background.
DeepGlobe Data [
102] is another important dataset in land cover mapping. The dataset stands out as the inaugural publicly available collection of high-resolution satellite imagery primarily emphasizing urban and rural regions. This dataset comprises a total of 1146 satellite images, each with dimensions of 20448 × 20448." [
103]. It is of great important to Land Cover Classification Challenge. Likewise, the Inria dataset [
104] consist of aerial visual images encompassing 10 regions in the United States and Austria, collected at a 30 cm resolution, with RGB bands [
105]. It is organised by 5 cities in both in training and test data. Every city includes 36 image tiles, each sized at 5000 × 5000 pixels, and these tiles are divided into two semantic categories: buildings and non-building classes.
In addition, Disaster Reduction and Emergency Management Building dataset exhibits a notable similarity to the Inria dataset. It has image tiles size of 5000 × 5000 with a spatial resolution of 30 cm, all the tiles contain R, G, and B bands [
99]. The building dataset from Wuhan University comprises an aerial dataset encompassing 8189 image patches captured at a 30 cm resolution. These images are in RGB format, with each patch measuring 512 × 512 pixels [
106]. The Aerial Imagery for Roof Segmentation dataset [
107] is composed of aerial images that encompass the Christchurch city area in New Zealand. These images are captured at a resolution of 7.5 cm and include RGB bands [
105]. It was captured following the seismic event that impacted the town of Christchurch in New Zealand. Four images, each with dimensions of 5000 × 4000 pixels, were labeled to include the following categories: buildings, cars and vegetation [
108]. Other benchmark datasets include the Massachusetts building and road datasets [
109], Dense labeling RS dataset [
110], VEhicle Detection in Aerial Imagery (VEDAI) dataset and LoveDA dataset [
111].
3.4. RQ4. What Are the Architecture and Performances of Semantic Segmentation Methodologies Used in Land Cover Mapping?
This section investigates the design and effectiveness of recent advancements in novel semantic segmentation methodologies applied to land cover mapping. In this paper, the methodologies employed in land cover mapping are classified based on similarities in their structural components. We have identified three primary architectural structures: encoder-decoder structures, transformer structures, and hybrid structures. Hybrid structures involve the integration of various architectural elements, including deep learning components, encoder-decoder models, transformers, module fusion techniques, and other parameters. Among 80 articles employing different model structures, 59% utilized hybrid structures, 36% utilized encoder-decoder structures, and 5% utilized transformer-based structures.
Encoder-decoder structures consist of two main parts: an encoder that processes the input data and extracts high-level features, and a decoder that generates the output (e.g., segmentation map) based on the encoder's representations [
16]. The authors [
112] suggested an innovative encoding-to-decoding technique known as the Full Receptive Field network, which utilizes two varieties of attention mechanisms, with ResNet-101 serving as the fundamental backbone. Similarly, a different DL segmentation framework known as the DGFNET Dual-gate fusion network. [
113] adopts an encoder-decoder architecture design. Typically, encoder-decoder architectures encounter difficulties with the semantic gap. To address this, the DGFNET framework comprises two modules: the Feature Enhancement Module as well as the Dual Gate Fusion Module mitigate the impact of semantic gaps in deep convolutional neural networks, leading to improved performance in land cover classification. The model underwent evaluation using both the landcover dataset and the Potsdam dataset, achieving MIoU scores of 88.87% and 72.25%, respectively.
The article [
114] proposed U-Net incorporating asymmetry and fusion coordination. It is an encoder and decoder architecture with an integrated coordinated attention mechanism, a non-symmetric convolution block refinement fusion block that gets long term dependencies and intricate information from RS data. It was reported that the method was evaluated on DeepGlobe datasets and performed best MIoU of 85.54% as reported compared to other models like UNet, MAResU-Net, PSPNet, DeepLab v3+ etc. However, the model has low network efficiency and not recommended for mobile applications. Also, [
115] suggested the Attention dilation-LinkNet neural network, which contains an encoder-decoder structure. It takes advantage of serial-parallel combination dilated convolution and 2 channel-wise attention mechanisms, as well as pretrained encoder to be useful for satellite image segmentation particularly road extraction. The best performance of an ensemble of the model achieved an IoU of 64.49% on DeepGlobe road extraction dataset.
Table 2 tabulates some semantic segmentation models using encoder decoder structure. It shows that these models have relatively impressive generalization performances on different data, however, accuracy can be further enhanced through parameter optimization.
Transformer-based architectures are neural network structures originally designed for natural language processing (NLP), utilizing transformer modules as their fundamental building components. In the context of land cover mapping tasks, transformer-based architectures such as the Swin-S-GF [
120], BANet [
121], DWin-HRFormer [
29], spectral spatial transformer [
122], Sgformer [
18], and Parallel Swin Transformer have been developed. Table 9 presents various transformer-based structures alongside their performance metrics and limitations. Researchers have noted that while these architectures achieve effective segmentation accuracy with an average OA of approximately 89%, transformers can exhibit slow convergence and computationally expensive, particularly in land cover mapping tasks. This limitation contributes to their relatively low adoption in land cover segmentation applications.
Table 3.
Transformer-based semantic segmentation models for land cover segmentation.
Table 3.
Transformer-based semantic segmentation models for land cover segmentation.
Models |
Data |
Performance |
Limitation |
Swin-S-GF [120], |
GID |
OA = 0.89 MIoU=80.14 |
Computational complexity issue, and Slow convergence speed |
BANet [121] |
Vaihingen, Potsdam, UAVid dataset |
MIoU=81.35, MIoU=86.25, MIoU=64.6 |
Combine convolution and Transformer as a hybrid structure to improve performance. |
Spectral spatial transformer [122] |
Indian dataset |
OA=0.94 |
Computational complexity issue |
Sgformer [18] |
Landcover dataset |
MIOU=0.85 |
Computational complexity issue, and Slow convergence speed |
Parallel Swin Transformer [123] |
Postdam, GID WHDLD |
OA=89.44, OA = 84.67, OA=84.86 |
Performance can improve. |
A hybrid-based structure combines elements from different neural network architectures or techniques to create a unified model for semantic segmentation. Traditional convolutional neural network methods face limitations in accurately capturing boundary details and small ground objects, potentially leading to the loss of crucial information. While deep convolutional neural networks are applied for classifying land use covers results often show suboptimal performance in land cover segmentation task [
75]. However, this result can be tackled by hybrid through introduction of encoder-decoder style semantic segmentation models, leverage existing deep learning backbone [
70], and explore diverse data settings and parameters in their experimentation [
124]. Other methods of structure’s enhancement include architectural modifications through the integration of attention mechanisms, transformer architecture, module fusion, and multi-scale feature fusion[
125,
126]. Example is the SCOCNN framework [
127], which addresses the limitation faced by CNN through module integration: A module for semantic segmentation, a module for superpixel optimization, and a module for fusion. While the evaluated performance of the framework demonstrated improvement, further enhancement in boundary retrieval can be achieved by incorporating superior boundary adhesion and integrating it into the boundary optimization module.
Moreso, [
128] proposed multi-level context-guided classification method Object-based CNN. It involves high level feature-fusing and employed a Conditional Random Field for better classification performance. The model attained a comparable overall accuracy with DeepLabV3+ at various segmentation scale parameter on Vaihingen dataset and suboptimal overall accuracy to DeepLabV3+ on Potsdam dataset. Another approach identified is utilizing a Generative Adversarial Network-based approach for domain adaptation, such as Full Space Domain Adaptation Network [
106] as well as leveraging domain adaptation and transfer learning [
129]. It has proven to enhance accuracy in scenarios where source and target images originate from distinct domains. Although the domain adaption segmentation using RS images remains largely underexplored [
130]. The authors [
97] presented a CNN based SegNet model that classifies terrain features using 3D geospatial data, the model did well on building classification than other natural objects. The model was validated on Vaihingen dataset and tested on Potsdam dataset, achieved IoU of 84.90%.
In addition, [
131] proposed SBANet, stands for Semantic Boundary Awareness Network used to extract sharp boundaries, ResNet was employed as the backbone. Subsequently, it was enhanced by introducing a boundary attention module and applying adaptive weights in multitask learning to incorporating both low and high-level features, with the goals of improving land-cover boundary precision and expediting network convergence. The method was evaluated on Potsdam and Vaihingen semantic labelling datasets, they reported that SBANet performed best compared to models like UNet, FCN, SegNet, PSPNet, Deeplab3+ and others. DenseNet-Based model [
132], a proposed method modified one of the DL backbones DenseNet by adding 2 novel fusions that is the unit fusion and cross-level fusion. The unit fusion is well detailed-oriented fusion and the other integrates different information levels. This model with both fusions performed best on the DeepGlobe dataset.
Furthermore, [
98] Suggested a bidirectional grid fusion network, a 2-way fusion architecture for classifying land in very high-resolution RS data. It encourages bidirectional information flow with mutual benefits of feature propagation, a grid fusion architecture is attached for further improvement. The best refined model was tested on ISPR and GID datasets achieved MIoU performances of 68.88% and 64.01%, respectively.
Table 4 shows some identified hybrid semantic segmentation models and performance metric in land cover mapping. These models have demonstrated effective performances with an average overall accuracy of 91.3% across presented datasets.