1. Introduction
Climate variables and their changes affect landslide activity [
1,
2,
3] leading to an increase in rainfall-induced landslides in many areas of the world. The threat of shallow landslides on people, environment, structures and infrastructures is becoming more severe. A reliable landslide susceptibility prediction has to account for these types of landslides [
4]. Landslide susceptibility assessment is an effective way to manage and evaluate landslides by predicting and demonstrating the probability and distribution of possible landslides in a certain area [
5,
6,
7]. The spatial distribution of landslides occurrence at the regional scale [
8,
9] provides a scientific basis for landslide disaster prevention and regional development planning [
10].
To predict “where” landslides are likely to occur, several methods have been proposed [
9]. Geomorphological mapping, analysis of landslide inventories, heuristic terrain and susceptibility zoning, physically-based numerical modeling, and statistically-based classification methods are just some of the proposed approaches [
11,
12,
13,
14,
15,
16,
17].
In addition to the traditional models, multifarious machine learning models have been developed and used to map landslide susceptibility [
18,
19,
20,
21].
In recent years, due to the rapid development of computer technology there has been a notable shift towards exploring advanced modeling architectures capable of delivering superior predictive performance. Techniques such as deep learning [
22,
23], ensemble modeling [
24], and hybrid approaches [
25,
26] have gained prominence, and showed the power of machine learning algorithms to predict landslide susceptibility [
26,
27,
28]. Among these techniques, the Random Forest (RF) machine learning algorithm is particularly promising. This model operates by constructing an ensemble of decision trees from randomly selected subsets of the input data, combining their predictions to derive a final output. RF demonstrated effectiveness in addressing both classification and regression tasks [
29,
30]. The strength of RF model lies in its ability to use the combined knowledge of individual trees while mitigating the risk of overfitting [
31].
In addition, the RF model is particularly advantageous in the landslide susceptibility assessment, due to its capacity to include different types of data [
32]. Both numerical and categorical variables, covering factors such as slope angle, lithology, and land use, can be seamlessly integrated without forcing strict distributional assumptions. This flexibility makes the RF model well-suited for analyzing the complex interaction of factors contributing to landslides occurrence.
The main factors that induce shallow landslides are divided into two categories: static and dynamic. Static factors, also known as conditioning factors, include topographic features, geological conditions, and land cover. Dynamic factors, also known as triggering factors, are the rainfall and soil moisture conditions.
The assessment of triggering factors is important for the reliability of susceptibility maps. Landslide susceptibility modeling has tipically focused on static information [
33,
34,
35], while the dynamic influence of rainfall and soil moisture have often been ignored [
36,
37]. Recent advancements in data-driven solutions allowed for the incorporation of spatiotemporal variations of these variables, leading to more accurate and comprehensive landslide hazard assessments [
38]. By incorporating explanatory variables that reflect spatiotemporal variations, such as rainfall and soil moisture parameters, the models can capture the complex interactions between environmental factors and landslide triggering mechanisms [
39,
40].
The evolution of data-driven landslide susceptibility models has been marked by an increasing sophistication in methodological approaches and a growing recognition of the importance of spatiotemporal dynamics [
28,
32,
41,
42,
43,
44,
45].
The space-time solution approaches explicitly account for the temporal dimension of landslide occurrences, incorporating dynamic variables like rainfall and antecedent hydrological conditions to capture the evolving nature of landslide hazards [
46]. These methodologies allow for the simultaneous assessment of landslide risk across both geographic space and time, enabling a more comprehensive understanding of the dynamic nature of land-slide occurrences [
47].
High-resolution soil moisture and precipitation data have become essential in understanding and predicting landslide events, especially in regions prone to rainfall-induced landslides [
48,
49,
50,
51].
Soil moisture and precipitation data can be derived from in situ measurements, but they have limitations. For example, the accuracy of both rainfall intensity-duration and cumulative rainfall-duration (E-D) models are influenced by various factors, such as the resolution of rainfall and geohazard data, the spatial relationship between rain gauges and landslide locations, and the criteria used to define threshold limits [
52,
53]. These approaches typically involve calculating rainfall parameters for each landslide event based on data from relevant rain gauges [
54]. The selection of appropriate rain gauges remains a challenge, with no universally accepted methodology [
55]. In addition, the spatial limitations of rain gauges hinder the accurate assessment of rainfall and soil moisture conditions in close proximity to landslide sites.
A promising solution to overcome these limitations is offered by satellite-derived data. By leveraging remote sensing technologies, researchers can obtain comprehensive, high-resolution rainfall and soil moisture data over large areas, providing a more accurate and spatially continuous representation of the environmental conditions that influence landslide occurrence [
56,
57,
58,
59]. The integration of these satellite-derived data into landslide prediction models has the potential to improve the accuracy and reliability of landslide forecasts, thereby enhancing the ability to mitigate the risks associated with these natural hazards.
In this study, the limitations of raingauge-based susceptibility models are addressed by integrating high-resolution satellite derived soil moisture and precipitation data with advanced machine learning techniques. By combining these powerful tools, a more sophisticated model that captures the intricate relationships between the various factors is developed, leading to a deeper understanding of the critical conditions under which landslides are triggered.
The document is organized as follows:
Section 2 provides a detailed description of the study area, defining the main geological and climatic features. An overview of the static and dynamic factors considered in the study is reported in
Section 3.
Section 4 describes the machine learning method used, and
Section 5 presents the results of the application of the model to a selected study area and the discussion related to the 2019 severe weather events which triggered a significant number of landslides in the area.
3. Materials
Data acquisition for landslide susceptibility analysis begins with the collection of raw data from various sources, including digital terrain models (DTMs), geological maps, Landsat 7 satellite images, and satellite-based rainfall and soil moisture products records. Ten static conditioning factors are selected and derived from the collected data: elevation, slope angle, aspect, plan curvature, profile curvature, geology, land cover, normalized difference vegetation index (NVDI), distance to road, and distance to river. These static factors have been commonly used by previous researchers in studying landslides see e.g. [
63,
64].
In addition to the static conditioning factors, daily rainfall and daily soil moisture data, along with two antecedent cumulative rainfall parameters (7 days and 15 days before the events), are selected to characterize soil moisture and rainfall conditions that trigger landslides during rainstorms.
3.1. Static Conditioning Factors
Landslide occurrence is directly impacted by the angle of the slope, a crucial factor contributing to shear stress and subsequent terrain instability. The digital terrain model (DTM) provides insights into regional topography and geomorphology, reflecting the influence of slope’s geographical features on landslide evolution. For this reason, five widely used slope characteristics – elevation, slope angle, aspect, plan curvature, and profile curvature – are calculated correspondingly. For instance,aspect indicates the direction a slope faces (represents the compass direction 0° to 360°), influencing sunlight exposure, moisture content, vegetation growth, and erosion rates. Slopes with certain aspects may retain more moisture, have less vegetation, or experience rapid snowmelt and freeze-thaw cycles.
Plan curvature and profile curvature represent the effects of topographic gradient on flow rate and water flow patterns, respectively. Curvature, generally, describes a surface’s deviation from flatness [
65].
QGIS software, utilizing three-dimensional analysis and data management tools, calculates and maps topographic and geomorphological factors to predict landslide susceptibility.
Figure 4 illustrates the distribution of elevation (
Figure 3 A) and slope angle (
Figure 3 B) across the study area.
Figure 3.
Elevation distribution in the study area (A) and slope distribution (B). Red dots show the occurred landslides in the study area.
Figure 3.
Elevation distribution in the study area (A) and slope distribution (B). Red dots show the occurred landslides in the study area.
To define the lithology distribution the classification proposed by Bucci et al. 2022 [
60] was adopted (
Figure 2). The hydrological environment was obtained by considering the distance to the nearest river for each point within the study area (
Figure 4).
Figure 4.
Spatial distribution of the distance to river parameter in the study area.
Figure 4.
Spatial distribution of the distance to river parameter in the study area.
Land cover, particularly vegetation, provides both hydrological and mechanical effects on slope stability. Land cover is an important conditioning factor for the rainfall-triggered landslides [
66,
67] due to its effect on the soil mechanical behavior and soil moisture (
Figure 5 A). In this study, the normalized difference vegetation index (NDVI), derived from remote-sensing data, is used to indicate the growth condition of vegetation. NDVI values close to or less than zero and decreasing negative values indicate non-vegetated features, such as rock and bare soil surfaces, water, and artificial structures. Higher positive values represent denser green vegetation, such as grass and forest (
Figure 5 B).
3.2. Dynamic Conditioning Factors
Rainfall and soil saturation conditions at the onset of a storm event are two of main triggering factors for landslides activation. Despite the study area is densely instrumented, localized events as well as local features may impact the rainfall and soil saturation observed fields. Moreover, over the study area, soil moisture monitoring networks are not present and modeled estimates could be impacted by the quality of the input rainfall. The use of hydrometeorological variables obtained through remotely sensed information could mitigate such limitations. To this end, in this study soil saturation and rainfall obtained through remotely sensed information has been used to train the proposed model. Rainfall data are obtained through the integration of multiple datasets in order to achieve a more reliable and more resolute field. More in details, the state-of-the-art Integrated Multi-satellitE Retrievals for GPM Late Run product [
68], the Climate Prediction Center Global Unified Gauge-Based Analysis of Daily Precipitation [
69] and the soil moisture-derived rainfall through SM2RAIN algorithm [
70] are integrated and then downscaled a 1 km of spatial resolution.
Table 1 summarizes the main features of the parent products.
For more details about the integration and the downscaling procedures, the readers are referred to Filippucci et al., 2022 [
71]. The soil moisture data have been obtained through the GLEAM model [
69].
GLEAM is a modelling framework dedicated to the estimation of evaporation over land through satellite data. The model estimates the different components of evapotranspiration, i.e. transpiration, bare-soil evaporation, interception loss, open-water evaporation and sublimation along with surface and root-zone soil saturation conditions. More in details, soil moisture is obtained by taking advantages of a multi-layer balance model that can assimilates satellite observations. The estimates used in the current study are those obtained by forcing the model with the same product used for retrieving the rainfall conditions, guaranteeing the same temporal and spatial resolutions.
The static and dynamic factors considered in this study for the landslide susceptibility assessment are listed in
Table 2 along with the data sources.
4. Method
To evaluate the landslide susceptibility of the area the Random Forest model (RF) has been used.
Evaluation of the RF model typically involves partitioning the dataset into training and testing subsets. The training data are employed for the construction of the decision tree ensemble, while the testing data are used to assess the model’s predictive performance.
4.1. Machine Learning
The RF model was implemented using the ‘randomForest’ package in the R 4.6.0 statistical software environment [
29]. The process began with input data from approximately 400 landslide locations and 400 non-landslide locations, utilizing features such as slope angle, lithology type, elevation, and rainfall. Multiple decision trees were built by the RF algorithm, each with different nodes based on these features. For example, slope angle might be used as the root in one tree, while lithology type could be the next node; in another tree, elevation might be the starting point followed by rainfall. This variety in tree construction allows different patterns in the data to be identified by the model.
Once the trees were built, new data points could be evaluated to assess the probability of a landslide occurring at those locations. The features of the new points—such as slope angle, lithology, elevation, and rainfall—were processed by each tree in the forest to provide a binary output, either “yes” (indicating a landslide is likely) or “no” (indicating it is not). The final output was determined by calculating the percentage of trees that predicted “yes.” For instance, if 70 out of 100 trees predicted “yes,” a probability of 0.7 (or 70%) for a landslide was assigned by the model. Selecting the optimal train/test split ratio is crucial for improving model performance. A well-defined split ratio reduces the risk of overfitting and enhances the model’s ability to perform effectively on unseen data. To ensure balanced data, 410 non-landslide samples, equal to the number of landslide samples, were utilized in the study. The proportion of training data varied from 0.5 to 0.9, which resulted in an increase in the number of landslide samples considered for model training from 200 to 360.
The classification accuracy of the RF model on the testing dataset, evaluated at different train/test split ratios, is shown in
Figure 6 A. It was observed that as the training subset ratio increased, the model’s classification accuracy on the testing dataset also improved, peaking at a 0.7 (70/30) split. This finding aligns with the recommendations of Gholamy et al. (2018) [
72], who advocate for a 70/30 split as optimal for model training and testing in similar contexts.
To minimize the out-of-bag error rate of bootstrap samples, the hyperparameters ‘ntree’ (number of trees) and ‘mtry’ (number of predictor variables considered at each split) were fine-tuned and optimized for each dataset using the model’s tuning function, as illustrated in
Figure 6 B and
Figure 6 C.
Four key performance indicators are used to evaluate the performance of RF in this study (
Figure 6 D): accuracy, recall, precision, and the receiver operating characteristic (ROC) curve. To understand these metrics, we first define the four prediction types in binary classification for landslide prediction:
True Positive (TP): Correctly predicts a landslide where one occurred.
False Positive (FP): Incorrectly predicts a landslide when none occurred.
True Negative (TN): Correctly predicts no landslide when none occurred.
False Negative (FN): Incorrectly predicts no landslide when one occurred.
The performance indicators are:
Accuracy: The ratio of correct predictions (TP + TN) to all predictions (TP + FP + TN + FN).
Recall: The model’s ability to identify actual landslides (TP) out of all true cases (TP + FN).
Precision: The reliability of predicting landslides, representing true positives (TP) among all predicted positives (TP + FP).
4.2. Relative Importance
In the second part of evaluating the performance of the RF, the statistical relationships between independent variables were assessed to determine potential correlation issues, prior to the conditioning factors importance analysis being investigated. Spearman’s correlation matrix [
73] was employed for this purpose (
Table 3), due to its suitability for both continuous and ordinal data [
74] being recognized. Multicollinearity, indicated by a high degree of relationship between variables, was identified by Spearman correlation values exceeding 0.7.
Table 3 reveals strong correlations among daily and antecedent rainfall factors (e.g., 7-day and 15-day antecedent rain). Such correlations have been shown to have minimal impact on the predictive performance of machine learning models [
75], they were still taken into consideration.
RF model assessed the relative importance of conditioning factors for landslide susceptibility prediction using mean decrease accuracy (MDA) and mean decrease GINI (MDG). MDA quantifies the reduction in model accuracy when a factor is excluded, while MDG measures a factor’s contribution to node purity. Higher values of MDA and MDG indicate greater importance for a given factor (
Figure 7).
All selected conditioning factors were observed to have positive values for landslide prediction model learning, although their degree of contribution differed. Specifically, among static topographic factors, slope angle was noted to be of the highest importance, followed by elevation, profile and plan curvature. Surface cover materials, indicated by land cover and NDVI, were found to play a more significant role than subsurface materials when compared to geological factors.
Regarding dynamic rainfall conditions, daily rainfall was identified as more important than antecedent rainfall factors (i.e., 7-day and 15-day). This is consistent with the understanding that most landslides in Italy are classified as shallow, primarily triggered by short, intense rainstorms. Within these factors, soil moisture was determined to have a relatively greater influence than antecedent conditions.
Distance to road and geology were found to be the least significant among the 14 selected conditioning factors. However, it should be noted that the determined importance of road distribution in the model predictions may have been affected by the exclusion of human-made landslide triggering events from the input landslide dataset. The geological features were not found to be as helpful in predicting landslides as other factors. This could be because landslides were observed in areas with very different types of rocks and soil, making it difficult to find a single geological cause. It was suggested that separating the geology data into smaller groups and considering the effects of weathering might make the predictions better.
Although the removal of conditioning factors with null predictive value has been suggested in some studies [
76,
77], all 14 landslide conditioning factors in this study were found to have positive predictive capability. Furthermore, it has been indicated in literature (see e.g. [
78]) that the importance of a single conditioning factor like slope angle can be site-specific and dependent on the scale of analysis and selection method. Therefore, all 14 conditioning factors were utilized in the analysis for the development of the model.
5. Results and Discussion
5.1. Rainfall Events
To evaluate the susceptibility map two major rain storms with the highest number of induced landslides were selected for analysis.
The rainfall hydrogram, obtained from the procedures developed by CNR-IRPI in Perugia, is shown in
Figure 8.
The first event (R1), occurred in October 2019, brought intense rainfall and subsequent landslides to northwestern Italy, causing widespread disruption and displacement. Locations such as San Bartolomeo del Bosco, Osigli, and areas surrounding Genoa were particularly affected. The consequences included road closures, isolated communities, evacuations, and disruptions to public transportation. Infrastructure damage was also reported, with a bridge collapsing in Capriata d’Orba. In total, this rainstorm resulted in 39 natural terrain landslides and 2 fatalities, with a maximum reported rainfall of 73 millimeters.
Figure 9, besides showing the values of maximum daily rainfall, illustrates the average daily rainfall in the studied area and the antecedent 15-day cumulative average rainfall over the study area between October and late January.
The second event (R2), on 23 November 2019, impacted a broader range of areas from the northeast to the south of the studied region. This event led to significant financial and human losses, particularly in the Liguria area, resulting in 45 natural terrain landslides and directly affecting more than 120 people. As shown in
Figure 8, the maximum rainfall amounts in both events were similar, reaching close to 70 millimeters, but the cumulative average rainfall on 23 November reached higher values compared to the event in October.
5.2. Modeling Results
Using the collected data, which includes conditioning factors in the study area as well as values extracted from satellite data (including daily rainfall, soil moisture, 7 and 15 days cumulative rainfall at each point), two landslide susceptibility maps were created for the two events under review using the optimized RF model.
Figure 10 D and
Figure 11 D respectively show the predicted landslide susceptibility results for these two events. The locations of landslides triggered by each rainfall event are indicated in both figures. From the predicted landslide susceptibility and the spatial distribution of rainfall and soil moisture parameters for the two events (
Figure 9 and
Figure 10), some interesting results can be identified:
In the October event, the concentration of rainfall in the northern and central sectors of the area can be clearly observed in
Figure 9 A, and to better understand this, the points where landslides occurred on that day are marked. By comparing daily rainfall and examining the 15-day cumulative rainfall, a spatial shift in rainfall concentration from the central areas to the western and southern parts of the study area (
Figure 9 B) can be observed. This 15-day rainfall concentration in the central areas could have a very interesting correlation with the spatial distribution of soil moisture values on the day of the event (
Figure 9 C). By examining the predicted landslide susceptibility results for October 2019, a meaningful correlation can be observed where both predicted high-risk landslide zones and the concentration of the 15-day rainfall and soil moisture values are located. Soil moisture and rainfall conditions play an important role in spatially identifying landslides and can vary significantly with different rainfall events. This difference can be observed in the spatial distribution of maximum rainfall on this day and the location of landslides. On this day, the maximum rainfall was 70 mm, which occurred in the northern part of the study area, but the landslides occurred in the central areas. These results clearly emphasize the importance of considering rainfall and soil moisture factors in accurately understanding and identifying high-risk areas.
As shown in
Figure 10, on 23 November, the intensity of rainfall was almost the same as on 20 October, with both experiencing approximately 70 mm of rainfall. However, on 23 November, the spatial distribution of rainfall was concentrated in the western and southwestern regions and also affected a larger area from north to south. This difference can also be observed in the average daily rainfall presented in
Figure 8 for these two events. On this day, the average rainfall across the entire area was nearly twice the reported average rainfall for 20 October. By examining the distribution of landslides that occurred on 23 November and the spatial distribution of soil moisture and rainfall on that day, the accuracy of the model in predicting high-susceptibility areas can be recognized. Similarly, the maximum predicted susceptibility index on 23 November is higher than on 20 October (0.93). Although these calculated susceptibility values may not be quantitatively accurate, it has been proven that the proposed model can correctly consider various intensities of soil moisture and rainfall-related parameters. For 23 November, the high-susceptibility area predicted was not concentrated in a specific region but was able to identify points that were in the eastern part of the study area, where the spatial distribution of soil moisture and intense rainfall was accurately the location of landslides. The proposed spatiotemporal landslide prediction model can satisfactorily respond to rainfalls with distinct spatial distributions, which was not reported in the previous case study.
5.3. Reliability Results
To gain a better understanding of the performance of the presented model, it is essential to investigate the model’s ability to differentiate and assess the impact of hydrological factors on the predicted values of landslide susceptibility. To this end, the susceptibility difference values for each pixel in two events (
Figure 13), along with the difference values of daily rainfall (
Figure 11 A),15-day cumulative rainfall (
Figure 11 B), and soil moisture (
Figure 12), were extracted and presented.
In the eastern part of the study area, higher daily and 15-day cumulative rainfall values were observed in November compared to the October 20 event, leading to an increase in areas with a high probability of landslides (red dots) in the eastern regions. Additionally, an examination of the obtained soil moisture values for this event compared to the previous one indicates higher soil moisture in the second event, thus increasing the probability of landslides in this area. As reported in the landslide database, four landslides occurred in these areas on this day, all of which were correctly classified by the model as very high risk.
However, a noteworthy point in this area is that in high-altitude regions like Monte Carmo, located in the southeast, the measured moisture values at the peak were higher in the first event than in the second event, yet the susceptibility values measured in this area were higher in the second event. To investigate the causes of this phenomenon,
Figure 11, which respectively show the daily (
Figure 11 A) and 15-day cumulative rainfall (
Figure 11 B) in this area, can be examined. It can be observed that the cumulative rainfall in this area was approximately 240 mm higher in the second event than in the corresponding 15-day period of the first event. This demonstrates the model’s ability to consider multiple hydrological factors simultaneously in identifying areas with a high probability of landslides. However, upon examining the susceptibility values of these two events in these areas, it is noted that although the differences in susceptibility values seem significant,
Figure 13 reveals that these areas, due to their specific landscape characteristics, have a low probability of landslides.
Figure 12.
For each pixel in the study area and in relation to R1 and R2 events, the figure shows the soil moisture difference.
Figure 12.
For each pixel in the study area and in relation to R1 and R2 events, the figure shows the soil moisture difference.
Furthermore, on October 20, an analysis of the central and southern parts of the study area reveals that soil moisture values (
Figure 12) were more concentrated in the central areas compared to the second event (purple figure).
Additionally, by examining the reported landslides on this day and analyzing the cumulative rainfall and spatial distribution pattern of daily rainfall in this area, it can be concluded that the model was able to identify the high-probability areas in this region (yellow dots in
Figure 13).
Figure 13.
Spatial distribution of susceptibility difference evaluated in relation to the two rainfall events R1 and R2 for each pixel of study area.
Figure 13.
Spatial distribution of susceptibility difference evaluated in relation to the two rainfall events R1 and R2 for each pixel of study area.
This is a noteworthy point, as if only daily rainfall values were considered for the October 20 event, the maximum rainfall would have occurred in the northern and central regions, and without the model’s utilization of other dynamic parameters like soil moisture and 15-day cumulative rainfall, it would not have been able to produce meaningful results.
On November 23, a large number of landslides occurred in the western part of the study area. Examination of the daily rainfall distribution on this day reveals a concentration of rainfall in the western and southern parts of the study area. The extracted soil moisture values for this region also show higher values in the landslide-prone areas compared to the first event. In this region, there is a wider distribution of areas with a high probability of landslides compared to the October 20 event, indicating the successful performance of the model in identifying high-risk areas considering rainfall and moisture scenarios.
In the northwestern part of the study area, a higher probability of landslides was predicted for the November 23 event compared to October 20, despite lower soil moisture values on this day compared to October 20. This can be well explained by examining the daily rainfall distribution in this area, which shows higher rainfall values on November 23. Considering this, the model was able to predict areas with a higher risk class for this day. It is also worth mentioning that the 15-day cumulative rainfall values were higher in the second event compared to the first event.
5.4. Conclusions
Since the 1980s, hundreds of studies have contributed to the understanding of landslide susceptibility across various geological, climatic, and geographical settings. A variety of methods, both direct and indirect, have been employed, incorporating both qualitative and quantitative approaches.
In this study, a method for predicting landslide probability using the RF algorithm was presented. Significant improvements in predictive accuracy were achieved by integrating static landslide conditioning factors with high-resolution dynamic variables, such as soil moisture and both daily and cumulative rainfall derived from satellite data. A part of Po River region in Italy was analyzed, and the model, validated using two rainstorm events from 2019, successfully identified areas with a high probability of landslides by considering multiple hydrogeological factors and effectively captured the distinct characteristics of rainfall and soil moisture distribution and intensity. The predicted high probable zones are closely aligned with the observed spatial distribution of landslides, particularly during severe rainstorms.
While a general methodology for spatial and temporal landslide prediction was introduced, the results are site-specific. It is recommended that the method be recalibrated for other applications to identify the optimal configuration of parameters, especially when considering different types of landslides or dynamic variables.Future work could focus on examining the relationship between the time series of dynamic variables—rainfall and soil moisture—and landslide occurrences within specified time windows.