1. Introduction
Global Land Cover (GLC) plays a critical role in global resource monitoring and sustainable development research, as it provides vital insights into ecological diversity, carbon cycling, and human-environment relationships [
1,
2,
3]. In recent years, with the rapid development of remote sensing technology, land cover classification based on remote sensing images has become the mainstream method for modern land cover mapping.
Unlike satellite images which are usually obtained in near-real-time, GLC products are typically associated with substantial lag times between the processing of images and the release of data [
4]. The shortest production cycle of GLC products is usually one year, and seasonal or monthly datasets are relatively rare [
5]. In addition, the higher the product resolution, the longer the production cycle. Currently, there are many GLC products updated annually, such as the Moderate Resolution Imaging Spectroradiometer (MODIS) Land Cover Type product (MCD12Q1) with a resolution of 500 meters provided by National Aeronautics and Space Administration (NASA) from 2001 to present [
6]; Climate Change Initiative (CCI) program of the European Space Agency (ESA) provides a 300-meter dataset (1992-2018), and the Copernicus Global Land Service (CGLS) [
7] provides a 100-meter dataset for land cover; Chen et al. [
8] produced the GlobeLand30 with a resolution of 30 meters from Landsat and China’s HJ-1 satellite images for 2000 and 2010. The latest version was launched for updating in 2017 and officially released in 2020. GLC with yearly or even higher intervals limits the monitoring capability of land cover dynamic changes. Hence, improving the computational efficiency of various stages in land cover mapping under big data and automating the mapping process currently become the key issues that need to be addressed. Larger scale, higher resolution, and faster update frequency have exponentially increased the amount of data, posing new challenges for storage and computing power. At the same time, the land cover mapping process is complex, and each stage depends on different models. How to improve the efficiency of each link in land cover mapping, systematize and standardize the mapping process, and achieve automated mapping of large-scale land cover gradually become one of the key issues.
Sample collection is considered the most time-consuming and labor-intensive process in producing GLC [
9]. Existing methods for extracting classification samples still rely primarily on field investigations and visual interpretation of imagery. This approach not only demands extensive manual labor, but also restricts the study area’s scale significantly. To address this problem, some scholars have studied the automatic extraction of classification sample points by combining prior products and auxiliary data [
10,
11,
12,
13,
14]. This method uses classification rules extracted from prior products to quickly obtain a large number of spatially distributed and high-quality samples, thereby improving the efficiency of sample extraction and increasing the update frequency of land cover products. Zhang and Liu [
15] used a total of 27,858,258 training samples in the production of the GLC, which is several hundred times higher than the number of training samples traditionally selected manually, occupying a large amount of storage and computing space. ESRI [
16] even used a super-large dataset of Sentinel-2 with a total of more than 5 billion samples in the production of 2020 LandCover with a resolution of 10 meters. In the process of matching sample points and remote sensing images, the extraction of spectral features and indices of remote sensing images will consume a large amount of resources. The introduction of other auxiliary data such as elevation and climate [
17] further exacerbates the computational pressure on the matching step.
Remote sensing image mosaicking is the process of stitching multiple remote sensing images into a single geometric-alignment composite scene for a wide field of view. Large-scale image mosaicking is a computationally intensive process [
18]. Creating a national or global level product by mosaicking tens of thousands of image scenes that cover tens of terabytes of data could take several days or even weeks to complete [
19]. Using high-performance computing methods to process each task in parallel can improve the efficiency of image mosaicking. Researchers have studied parallel mosaicking methods, Zhang et al. [
20] combined aerial digital photogrammetry principles with parallel computing techniques to propose a parallel mosaicking method for massive aerial digital images. Chen et al. [
21] improved an algorithm used for video image mosaicking to make it suitable for remote sensing images, overcoming the problem of multi-images having to be divided into one-to-one mosaics. Ma et al. [
22] improved parallel mosaicking efficiency by creating a dynamic task tree to divide the processing order of remote sensing images. Remote sensing data, intermediate data, and result data will undergo frequent I/O operations during the computation process, which can lead to I/O bottlenecks. Jing et al. [
23] customized RDD operators in Spark to accelerate the parallel mosaicking process through in-memory computation. Ma et al. [
19] introduced a memory-based file management system called Alluxio, which is combined with compute-intensive remote sensing image mosaicking to effectively alleviate the I/O bottleneck in the mosaicking process. The process of mosaicking classification results is similar to that of remote sensing image mosaicking, but mapping after all data within the study area is collected may decrease the efficiency of dataset updating. Therefore, when mosaicking the classification results, it is necessary to consider that the acquisition time may vary between different regions.
The earliest remote sensing models were limited by the performance of desktop softwares running on a single machine. With the development of high-performance computing (HPC), some HPC-based methods have been considered as effective solutions for solving computational challenges, such as MPI/OpenMP [
24], Hadoop [
25], Spark [
26], and CUDA programming supported by GPUs [
27]. Many HPC frameworks targeting remote sensing and geographic information computing scenarios have also been proposed, such as SpatialHadoop [
28], Hadoop GIS [
29], GeoFlink [
30], etc. Although using these frameworks and migrating computing to the cloud has alleviated the pressure of local computing, the sharing of data and models still presents difficulties due to different users’ data storage and organizational formats [
31]. The models for various stages of GLC mapping use different programming languages and run in different environments, which leads to researchers spending a lot of time and effort on setting up and maintaining the environment. Nüst et al. [
32] have encapsulated runtime environment dependencies and analysis components in containers, improving the reusability, accessibility, and transparency of programs. Wang et al. [
33] have implemented the automatic encapsulation of spatial information processing operators in different environments based on containers. Atomic-level geographic processing services have limited capacity and and may not suffice to support large-scale and complex processing tasks[
34,
35]. Thus, it is necessary to construct complex workflows for efficient resource allocation to facilitate data and model orchestration of complex GLC mapping[
36].
To address the aforementioned challenges, this paper designed an high-performance automated large-scale land cover mapping framework(HALF) under a distributed architecture. This framework introduces high-performance computing technology into various stages of mapping and develops parallel computing strategies for automated sample production, classification result mosaic and updating based on data characteristics, effectively improving the computing efficiency in large-scale land cover scenarios. Furthermore, the models in each stage of the mapping process are encapsulated based on container technology and organized by Airflow, resolving the heterogeneity of the operating environment and facilitating data reuse. The HALF framework is empirically tested in several 10°×10° regions worldwide, providing theoretical and technical support for automated GLC producing.
4. Discussion
The HALF framework presented in this study offers an automated and high-performance solution for large-scale land cover mapping. By encapsulating each process’s models using container technology, HALF addresses the issue of model heterogeneity between different processes, resulting in a significant reduction of deployment workload without sacrificing operational performance. HALF integrated various stages in land cover mapping by using the CWL-Airflow workflow to organize and arrange the models, increasing the automation and flexibility. While the proposed method was experimentally verified for large-scale and efficient mapping, there is a need for service publishing and sharing functions for workflows. Collaborative efforts between experts from various fields and countries are crucial for effective GLC mapping. Real-time sharing of models, processes, and data can have significant implications for land cover mapping.
HALF provides a method of extracting a large number of homogeneous samples from multiple sources of prior products. This method facilitates the generation of a large number of spatially distributed and evenly balanced sample points, significantly reducing the human and material resources utilized in the sample selection process. However, the accuracy of the samples generated by this method is affected by the prior products. In this study, many samples were generated by overlaying classification products automatically, and the first-level classes in the FROM_GLC and GLC_FCS products were extracted. The classification system chosen was relatively coarse, and the samples produced by this method were constrained by the accuracy of the prior products, thereby affecting the training of the classification model. In future research, the quality of the produced samples can be more rigorously controlled by adopting a more refined classification system. In addition, techniques like change detection methods and transfer learning can be utilized to create a substantial number of uniformly distributed and high-precision samples.
HALF optimized and accelerated sample feature matching and classification result mosaic using high-performance computing technology. The specific effects of the method and its performance under different data volumes were discussed and explained in sections 3.4 and 3.5. The experimental results show that the matching method proposed in this study can quickly establish spatial relationships between points and images, and read image attribute data in parallel, effectively improving the computational efficiency of sample-image matching and feature extraction. Its efficiency is more than 10 times faster than conventional matching methods, and it performs well and stably under different data volume scenarios, making it applicable to other remote sensing applications that require large-scale vector and raster data matching. The classification result mosaic method proposed in the study uses image slices as basic units, achieving real-time mapping of large-scale remote sensing image classification results to products and improving mapping efficiency. The total time required for resampling and mosaic of a single Landsat image in a 10°×10° grid was about 6.5 seconds on average, solving the problem of low computational efficiency when synthesizing multiple images. However, there may be practical difficulties in expanding the study area to the entire world, and the computational burden will increase exponentially with increasing research scope and resolution. Therefore, data fusion with other spatiotemporal resolution remote sensing data can be utilized to generate more real-time products if a higher update frequency is desired.
5. Conclusions
The production process of large-scale land cover products is complex and requires more automated tools and higher-performance computing methods to meet the increasing requirements for satellite remote sensing data resolution and product update frequency. HALF aims to improve the operational efficiency of each link in the traditional process of land cover mapping in big data by utilizing high-performance computing technology. To address the heterogeneity of operating systems, running environments, and programming languages between different links, container technology is used to encapsulate models such as sample point generation, sample point-remote sensing image matching, model training and prediction, and classification result mosaic, supporting model reuse and data sharing, which greatly reduces the workload of deployment.Additionally, a general workflow language is introduced to model the land cover mapping process, organize data and models for each link, and decouple the production model from the overall process, thereby enhancing the automation and flexibility of the mapping process. In the future, we will further explore how to obtain higher-quality sample points from multiple land cover products, encapsulate them into online data services with network technology, and integrate them into an integrated production portal to better support researchers in the field. By continuously improving HALF, we can enhance the efficiency of large-scale land cover mapping and better support global resource monitoring and sustainable development efforts.
Figure 1.
The quantity and distribution of remote sensing image data.
Figure 1.
The quantity and distribution of remote sensing image data.
Figure 2.
Framework of HALF.
Figure 2.
Framework of HALF.
Figure 3.
Schematic diagram of homogeneous region extraction.
Figure 3.
Schematic diagram of homogeneous region extraction.
Figure 5.
Schematic diagram of the classification result mosaic. The world is partitioned into a grid system with each grid spanning 10° × 10°, and each grid is further divided into second-level sub-images. The blue border indicates the extent of the Landsat image, and when there are classification results within the grid, the pixels of the second-level sub-images within the spatial extent of the classification result are updated, and the final mosaic product is generated.
Figure 5.
Schematic diagram of the classification result mosaic. The world is partitioned into a grid system with each grid spanning 10° × 10°, and each grid is further divided into second-level sub-images. The blue border indicates the extent of the Landsat image, and when there are classification results within the grid, the pixels of the second-level sub-images within the spatial extent of the classification result are updated, and the final mosaic product is generated.
Figure 6.
Large-scale image mosaicking method.
Figure 6.
Large-scale image mosaicking method.
Figure 7.
Conceptual design of the workflow. Describes the cartographic phase of HALF, with yellow nodes representing external data, green nodes representing computation results, blue nodes representing models, and red nodes representing classification models.
Figure 7.
Conceptual design of the workflow. Describes the cartographic phase of HALF, with yellow nodes representing external data, green nodes representing computation results, blue nodes representing models, and red nodes representing classification models.
Figure 8.
Performance comparison between physical machine and container. (a) shows the comparison of time consumption, with various models on the horizontal axis and their corresponding running time on the vertical axis. (b) shows the comparison of deployment amount, with various models on the horizontal axis and an abstract estimate of deployment workload on the vertical axis.
Figure 8.
Performance comparison between physical machine and container. (a) shows the comparison of time consumption, with various models on the horizontal axis and their corresponding running time on the vertical axis. (b) shows the comparison of deployment amount, with various models on the horizontal axis and an abstract estimate of deployment workload on the vertical axis.
Figure 9.
Automatically generated samples. The title of each subfigure indicates the latitude and longitude of the lower-left corner of the region. The legend is shown on the right side of each figure, and the white color indicates that there are no samples in this area.
Figure 9.
Automatically generated samples. The title of each subfigure indicates the latitude and longitude of the lower-left corner of the region. The legend is shown on the right side of each figure, and the white color indicates that there are no samples in this area.
Figure 10.
Time consumption statistics of the matching method.
Figure 10.
Time consumption statistics of the matching method.
Figure 11.
Performance comparison of matching method, the orange line indicates the time consumption of the matching method of HALF, and the blue line indicates the time consumption of the matching method of GDAL.
Figure 11.
Performance comparison of matching method, the orange line indicates the time consumption of the matching method of HALF, and the blue line indicates the time consumption of the matching method of GDAL.
Figure 12.
Time consumption of the mosaicking method.
Figure 12.
Time consumption of the mosaicking method.
Figure 13.
Performance analysis of mosaicking methods.
Figure 13.
Performance analysis of mosaicking methods.
Figure 14.
Large-scale classification results mosaic.
Figure 14.
Large-scale classification results mosaic.
Figure 15.
Update on regional classification results of the Korean Peninsula.
Figure 15.
Update on regional classification results of the Korean Peninsula.
Figure 16.
Update on regional classification results of Japan.
Figure 16.
Update on regional classification results of Japan.
Table 1.
Comparison of Existing GLC Classification Systems.
Table 1.
Comparison of Existing GLC Classification Systems.
USGS |
CORINE |
FROM_GLC |
GlobalLand30 |
GLC_FCS |
1972 |
1985 |
2013 |
2014 |
2021 |
Forest |
Forest and semi-natural areas |
Forest |
Forest |
Forest |
Agricultural |
Agricultural areas |
Crop |
Cultivated land |
Cropland |
|
|
Shrub |
Shrubland |
Shrubland |
Range |
|
Grass |
Grassland |
Grassland |
Wetlands |
Wetlands |
Wetland |
Wetland |
Wetlands |
Urban or built-up |
Artificial surfaces |
Impervious |
Artificial surfaces |
Impervious surfaces |
Barren |
|
Bareland |
Bareland and tundra |
Bare areas |
Water |
Water bodies |
Water |
Water bodies |
Water body |
Perennial snow and ice |
|
Snow/Ice |
Permanent snow/ice |
Permanent ice and snow |
Tundra |
|
Tundra |
|
|
|
|
Cloud |
|
|
Table 2.
Model metadata information.
Table 2.
Model metadata information.
Column Name |
Data Type |
Length |
ARTIFACT_ID |
varchar |
50 |
NAME |
varchar |
50 |
DESCRIPTION |
varchar |
255 |
USAGES |
varchar |
100 |
MAIN_CLASS |
varchar |
100 |
CREATE_DATE |
datetime |
/ |
VERSION_ID |
int |
/ |
KEYWORDS |
varchar |
150 |
INPUT |
longtext |
/ |
OUTPUT |
longtext |
/ |
PARAMETERS |
longtext |
/ |
MODEL_PATH |
varchar |
255 |
MODIFY_DATE |
date |
/ |
TEST_CASE |
longtext |
/ |
Table 4.
Descriptions of sample points.
Table 4.
Descriptions of sample points.
Feature |
Data Source |
Characteristic |
Spatial Characteristics |
Landsat |
Longitude and Latitude |
Temporal Characteristics |
|
Image Acquisition Time |
Spectral Characteristics |
|
Band1,Band2...Band7 |
RS Index |
|
NDVI, NDWI, EVI, NBR |
Topographic Feature |
DEM |
DEM, Slope, Aspect |
Land Cover Type |
GLC |
First-level Type |
Table 5.
Time consumption analysis for each step of the matching method.
Table 5.
Time consumption analysis for each step of the matching method.
Stage |
Data size
Time cost(s) |
100000 |
200000 |
300000 |
|
|
GDAL |
HALF |
GDAL |
HALF |
GDAL |
HALF |
1 |
Build Spatial Index |
0.8 |
0.8 |
0.8 |
0.8 |
0.8 |
0.8 |
2 |
Read data and
create spatial relationships |
3.7 |
2.4 |
10.5 |
3.6 |
26.6 |
4.4 |
3 |
Distribute tasks |
/ |
3.9 |
/ |
7.0 |
/ |
8.4 |
4 |
Compute feature values |
179 |
23.0 |
247 |
28.2 |
283.6 |
30.7 |
5 |
Total |
183.5 |
30.1 |
258.3 |
39.7 |
311 |
44.3 |