1. Introduction
Soil provides the essential elements needed by terrestrial organisms and acts as a critical link connecting the atmosphere, hydrosphere, biosphere, and lithosphere. It is considered the central component of the Earth's critical zone [
1]. SM movement, in particular, serves as the primary driver of soil physical processes [
2]. Although SM constitutes only a small fraction of the Earth's total water reserves, it plays a pivotal role in numerous hydrological, biological, and biogeochemical processes. SM regulates land-atmosphere interactions, influencing climate and weather patterns [
3,
4], and is a crucial factor in the water and energy exchange within the soil-plant-atmosphere continuum [
5]. It is also a key element governing hydrological, biological, and biogeochemical processes within ecosystems [
6], as well as a determining factor for runoff potential [
7]. Consequently, accurate SM information is of great importance across a wide range of research fields.
To date, the challenge of acquiring soil moisture ground truth data with both large-scale spatial coverage and temporal continuity remains a major issue for many researchers. Most soil moisture products are derived solely through remote sensing retrievals. However, as a macroscopic observation technique, remote sensing inherently introduces errors when compared to actual ground truth values, necessitating validation of these products for accuracy [
8,
9]. Objective and quantitative evaluations of product accuracy are essential for improving the quantitative production of remote sensing products, making them reliable sources of information. Over the years, numerous research initiatives, both domestically and internationally, have focused on obtaining ground truth data for validation. These include efforts such as establishing long-term global soil moisture observation networks [
6,
10], using meteorological stations and other platforms [
11], or conducting short-term sampling in smaller areas to obtain accurate measurements [
12,
13]. Given the convenience of site selection and research objectives, most of these studies have been conducted in regions with relatively homogeneous soil moisture, such as farmlands and plains. For example, Park S-H et al. [
14] employed recurrent neural network long short-term memory (RNN-LSTM) models to predict soil moisture in soybean fields, achieving an impressive R² value of 0.999. Spatial heterogeneity in soil moisture plays a critical role in processes such as evapotranspiration, runoff, precipitation, and atmospheric variability. However, this heterogeneity also complicates comparisons between in-situ observations and airborne or satellite-based remote sensing retrievals [
15]. Consequently, retrieval in areas with high soil moisture heterogeneity becomes even more challenging. For instance, Abhilash Singh and Kumar Gaurav [
16] used a neural network to estimate surface soil moisture from satellite imagery of the large alluvial fans in the Himalayan foreland, achieving a correlation coefficient (R) of only 0.80. Another pressing issue is the current lack of high-resolution soil moisture data. Various studies have attempted to bridge the gap between coarse-resolution soil moisture products and finer-scale observations by integrating multiple high-resolution auxiliary variables [
17,
18]. Soil moisture products with a 1 km resolution are commonly generated using MODIS data [
18,
19,
20]. Although Landsat observations have the advantage of providing surface parameters at a higher spatial resolution (30 m), they are rarely explored for estimation of large-scale soil moisture products.
Estimating soil moisture using traditional statistical or physically based models is extremely challenging due to the complex and interdependent relationships between observed reflectance, surface conditions, and various climatic variables [
21]. With the advent of the big data era, once reliable predictive parameters are selected and robust algorithmic models are applied, these data can be effectively leveraged to map target variables. Data-driven approaches, which account for a range of parameters (such as temperature, humidity, and other factors affecting soil moisture), have proven to significantly improve soil moisture estimation accuracy [
22]. Machine learning, a form of artificial intelligence, is often faster and more efficient than traditional methods. It offers the advantage of understanding and estimating complex, non-linear mappings of distributed data without requiring prior knowledge. Moreover, it can integrate data from various poorly defined sources with unknown probability functions[
23]. Machine learning techniques have rapidly advanced in predictive modeling, allowing the identification of intricate, often non-linear data structures and the generation of accurate predictive models. However, it is important to note that such empirical data-driven models require a representative set of reference samples as ground truth. Collecting ground-based reference data, however, involves significant labor and resources. Additionally, errors may occur during the measurement process, leading to invalid data that can interfere with results. Redundant data can also reduce the efficiency of machine learning algorithms, affecting both the quality and quantity of usable reference samples [
23]. These models typically rely on location and sensor-specific data, as they are based on samples collected under particular operational conditions. This characteristic limits the applicability of such models across different regions and remote sensing systems, as their effectiveness depends on the availability of reference samples [
24,
25].
Couckuyt et al. [
26] categorized machine learning (ML) techniques into three major classes: (i) classical methods, (ii) ensemble methods, and (iii) neural networks and deep learning (DL) methods. In contrast to classical methods that rely on individual base learners, ensemble learning enhances predictive performance by combining multiple base learners. This approach offers greater accuracy and robustness compared to single models, demonstrating significant advantages in tasks such as classification, regression, and other complex analyses [
27]. Neural networks and DL methods, on the other hand, are capable of capturing more intricate relationships between predictors and target variables. However, due to the large and complex nature of these models, they require extensive sample data for precise predictions, making them less suitable for smaller datasets. Machine learning has already been widely applied in various domains, and one of the most valuable areas is soil moisture measurement [
28]. ML has been employed to develop new algorithms capable of accurately predicting soil moisture content, which in turn can be used to improve accuracy or for other applications [
29]. Umesh Acharya [
30] evaluated the performance of several ML techniques in soil moisture retrieval across agricultural fields near the Red River Valley in North Dakota and Minnesota. The techniques assessed included Classification and Regression Trees (CART), Random Forest Regression (RFR), Boosted Regression Trees (BRT), Multiple Linear Regression (MLR), Support Vector Regression (SVR), and Artificial Neural Networks (ANN). Similarly, Yang Zhangjian[
11] used meteorological station data and various soil moisture datasets to estimate 1 km daily surface soil moisture (SSM) across China using ML models, achieving promising results.
To improve soil moisture estimation in regions with high heterogeneity, while minimizing errors introduced by multi-source data, we employed single-source satellite data to compute soil moisture-related indices. These indices were then used as features in machine learning models, yielding excellent predictive performance and producing spatially continuous, high-resolution (30 m) surface soil moisture maps. The specific objectives of this study are as follows: (1) to collect various soil moisture-related indices and compute them using Landsat 8 data; (2) to estimate soil moisture in the QLB-NET region by integrating in-situ soil moisture measurements, satellite remote sensing data, and elevation data through multiple machine learning models, compare the performance of four models, and map the soil moisture distribution in this area; and (3) to explore the contribution of feature indices derived from auxiliary data to the prediction of soil moisture levels. (4) Evaluation of QLB-NET's overall and local heterogeneity.
4. Discussion
Building on the high heterogeneity of soil moisture characteristics in the study area, we calculated over twenty indices that characterize soil moisture based on Landsat reflectance data. However, despite the good fitting results of these indices when originally proposed, they exhibited lower accuracy in representing soil moisture in our high-heterogeneity region. Therefore, we incorporated these indices, along with elevation and derived data (slope and aspect), as features for input into machine learning models. The final accuracy of the soil moisture estimations significantly improved, and we also generated spatial soil moisture maps. SHAP was used to interpret the black-box nature of the machine learning models, but both the Boruta method for feature selection and SHAP for black-box model interpretation are essentially data-driven black-box models themselves. Hence, to explore deeper mechanisms, other perspectives need to be considered.
A noticeable issue in the soil moisture image (
Figure 6) is the difficulty in representing narrow rivers and areas of oversaturated soil moisture. In the image, rivers appear to be in a relatively dry state, likely due to riverbeds, floodplains, and surrounding areas being composed largely of gravel and sandy soils, which have low field water-holding capacity. Another important point to consider is that ensemble learning models are limited in their predictions—they cannot predict data for uncovered spatial or temporal scales. In areas without in-situ measured data, prediction results are less accurate compared to areas with samples. Moreover, due to the constraints of the loss function, the model tends to predict values that lean towards the mean, with larger values being underestimated and smaller values overestimated. Therefore, the actual range of values is generally wider than the predicted range.
It is well-known that machine learning outcomes are to some extent dependent on sample size[
51]. When the sample size is too small, complex data-driven models cannot be employed, nor can precise results be obtained. In this study, we chose Landsat satellite data, with a revisit period of 16 days, to achieve high spatial accuracy. After filtering out low-quality remote sensing images caused by cloud cover and other factors, our sample size became quite limited, and the temporal data lacked continuity. How to obtain accurate inversion results with a small sample size remains a question worth exploring. Although neural networks and deep learning models can construct large frameworks to handle complex problems and achieve higher accuracy, they are driven by large sample sizes. Thus, we opted for the more suitable ensemble learning models among the remaining options. In regions with high soil moisture heterogeneity, if sufficient sample size is available, partitioning the study area and subsequently training models within these subregions may yield more accurate results, providing insights into the areas where the model performs well and where it does not. Future directions for improving prediction accuracy include integrating multi-source data [
61,
62,
63], coupling data-driven models with other models [
64,
65], and utilizing more complex large-scale models [
66,
67].
Figure 1.
The distribution of in-situ soil moisture stations and the land use, soil texture and elevation of the study area.
Figure 1.
The distribution of in-situ soil moisture stations and the land use, soil texture and elevation of the study area.
Figure 2.
The scale of factors affecting the spatial variability of soil moisture.
Figure 2.
The scale of factors affecting the spatial variability of soil moisture.
Figure 3.
Scatter plots of related features calculated based on elevation and indices related to SM (not all).
Figure 3.
Scatter plots of related features calculated based on elevation and indices related to SM (not all).
Figure 4.
Box plot of the Z-score obtained by the Boruta feature selection algorithm.
Figure 4.
Box plot of the Z-score obtained by the Boruta feature selection algorithm.
Figure 5.
Scatter plots of predicted SM against measured SM of the CatBoost (a), ERT (b), RF (c), and XGBoost (d) models during train (left) and validation (right).
Figure 5.
Scatter plots of predicted SM against measured SM of the CatBoost (a), ERT (b), RF (c), and XGBoost (d) models during train (left) and validation (right).
Figure 6.
Scatter plots of the four models on the test set.
Figure 6.
Scatter plots of the four models on the test set.
Figure 7.
Spatial distribution of SM in the 30-meter surface layer. Different columns represent different models, and different rows represent different dates.
Figure 7.
Spatial distribution of SM in the 30-meter surface layer. Different columns represent different models, and different rows represent different dates.
Figure 8.
A view of contribution to prediction using: (a) a bar graph of the average absolute SHAP value; and (b) SHAP global explanation.
Figure 8.
A view of contribution to prediction using: (a) a bar graph of the average absolute SHAP value; and (b) SHAP global explanation.
Figure 9.
SHAP heat map plot.
Figure 9.
SHAP heat map plot.
Figure 10.
SHAP dependence plots on (a) Elevation and VSWI, and (b) SWCI and VSWI for SM.
Figure 10.
SHAP dependence plots on (a) Elevation and VSWI, and (b) SWCI and VSWI for SM.
Figure 11.
CV values of the measured data and the predicted results of the model, where SM represents the CV of the measured sample and the other four represent the CVs of different ensemble models.
Figure 11.
CV values of the measured data and the predicted results of the model, where SM represents the CV of the measured sample and the other four represent the CVs of different ensemble models.
Figure 12.
Topographic complexity index distribution map of the study area.
Figure 12.
Topographic complexity index distribution map of the study area.
Figure 13.
Distribution of measured and model predicted values for each site in the 95% confidence interval range of the four ensemble models. where N=57, there are three sites with missing data, which are 41, 43 and 49.
Figure 13.
Distribution of measured and model predicted values for each site in the 95% confidence interval range of the four ensemble models. where N=57, there are three sites with missing data, which are 41, 43 and 49.
Figure 14.
Results of the test set of four ensemble learning models after removing high heterogeneity points.
Figure 14.
Results of the test set of four ensemble learning models after removing high heterogeneity points.
Table 1.
Performance of Campbell-CS655 SM sensor with different indicators.
Table 1.
Performance of Campbell-CS655 SM sensor with different indicators.
Performance Metrics |
Soil Conductivity |
soil moisture Volumetric Water Content |
Soil Temperature |
Range |
0-8dS/m |
0-100% |
-50-70℃ |
Precision |
±(5%the value+0.05dS/m) |
±3% |
±0.02 |
Accuracy |
0.5% |
<0.05% |
±0.5℃ |