1. Introduction
Climate change has led to a recurrence of hydrological extremes such as droughts and floods and future projections reveal an increase of such events (IPCC:2014, 2014; Ziervogel et al., 2014). Mishra & Singh, (2010) reported a number of droughts in Africa, Europe, Asia, Australia and North America dating as far back as 1890. Wanders et al.,(2015) projected an increase in drought duration and severity over 27% of the world, this includes; most parts of South America, Southern Africa and the Mediterranean. Most parts of Southern Asia, have been experiencing accelerated droughts due to rising temperatures in past few decades (Dilawar et al., 2021; Prodhan et al., 2022). In Uganda, 2 465 human lives were lost due to drought, this was globally the highest mortality due to drought in 2022. Climate change has also led to a number of floods at a global scale which resulted to 7 954 human lives lost in 2022 (Centre for Research on the Epidemiology of Disasters, 2023) this number exceeds the global average from 2002 to 2021. Countries with the highest mortality due to floods in 2022 include; India at 2035, Pakistan at 1739, Nigeria at 608 and South Africa at 544.
The impacts of these hydrological extremes in an arid to a semi-arid country like South Africa are significant. For example, the droughts experienced in most parts of the country in the hydrological year 2015 to 2016 led to negative socio-economic impacts. The agricultural industry contracted by 6.5 % while the electricity, gas and water industry contracted by 2.8 % in the first quarter of 2016 (Schreiner et al., 2018). Various parts of the country experienced water restrictions due to the droughts. This drought in the Western Karoo region of the Eastern Cape were incurred until early 2020 (Archer et al., 2022). The multi-year drought in the Eastern Cape resulted into low groundwater levels and subsequently dried up some boreholes (Holmes et al., 2022). On the contrary, the neighbouring province, Kwa Zulu Natal (KZN) has been experiencing frequent floods, like in April 2019 (Bopape et al., 2021; Olanrewaju & Reddy, 2022). Recently, in April 2022, KZN experienced one of the major floods in the country resulting in over 400 lives lost over the disastrous floods (Madzivhandila, Thanyani S, Maserumule, 2022). What is evident in literature is that disasters related to climate change are on a steep rise but the adaptive capacity towards these extreme seems to be lacking (Chandrasekara et al., 2021; Hussain et al., 2020; Quesada-Román et al., 2020). This gap, coupled with freshwater scarcity thus exacerbates the importance of understanding the interplay between climate and water resources, such knowledge is essential for the sustenance of water resources.
Freshwater is the most widely used form of water resources. Approximately 97% of the Earth’s total water is available as seawater which is saline and freshwater makes up about 3% (Gude & Nirmalakhandan, 2009; Liu, 2011). Approximately 69% of the freshwater is available in glaciers, polar caps and in the atmosphere. Groundwater proportion amounts to approximately 30% of the freshwater. While 0.3% is available as surface water in rivers, lakes and dams. The growing number of extreme climate events and population growth have further strained surface water resources, making it an unreliable source. Groundwater thus, plays a vital role in an arid to semi-arid region like South Africa. A study by the Department of Water Affairs (DWA), (2010) revealed that an additional 5 500 million m3 per annum of groundwater storage was available for use by small towns, mines, villages and individuals. Moreover, this groundwater potential could be increased by recharging aquifers during wet periods and preserving groundwater supplies for use during dry/drought periods (Pietersen et al., 2011). Pietersen et al. (2011) further showed that over 80% of rural communities in North West and Kwa-Zulu Natal provinces of South Africa get their water from groundwater sources and the same applies to over 50% communities in the Eastern Cape. In terms of urban areas, City of Tshwane combines water from boreholes with surface water in its bulk distribution system. Some towns such as De Aar rely solely on groundwater sources for its water supply.
Groundwater availability is affected by a wide range of factors which include; land use land cover (LULC), baseflow index, which refers to the proportion of baseflow to the total streamflow (Bloomfield et al., 2009), soil type, catchment area, catchment slope, precipitation, evapotranspiration and catchment geology (Bloomfield et al., 2009; Zomlot et al., 2015). Precipitation is by far the climatic parameter that closely relates to groundwater availability and water resources in general, serving as a major input into the hydrological cycle. In South Africa, the most available form of precipitation is rainfall. Rainfall, according to a study conducted by Mohan et al., (2018) was found to be one of the major factors affecting groundwater recharge. A similar study in USA also found that 80% of recharge variation is explained by Mean Annual Precipitation (Keese et al., 2005). Similarly findings by Sun & Cornish (2005) reveal that variations in recharge can primarily be explained by the climatic factor compared to land-use changes. The studies outlined above illustrate the importance of rainfall events in groundwater recharge. However, a study conducted in South Africa indicated that the lag time between rainfall events and groundwater level responses is less understood (Xu & Beekman, 2019). Similarly Qi et al., (2018) and Condon et al., (2021) highlight a knowledge gap on the direct response of groundwater levels to precipitation. The lag time is important for groundwater level prediction purposes and in developing the early warning systems.
One of the major challenges incurred in understanding the lag between rainfall events and groundwater level responses is that groundwater levels are often not available in real-time (Van Loon et al., 2017). There is a lack of data on the actual use of groundwater resources, and the state of aquifers is unknown in many parts of the world. Modellers are faced with dispersed data or data scarcity, especially in arid to semi-arid areas. Data such as aquifer geometry or hydraulic parameters are often unknown (Alfaro et al., 2017; Oke & Fourie, 2017). Monitoring networks and consistent in situ groundwater measurements are unavailable in some places (Sahoo et al., 2018). There are inaccuracies in recharge and pumping estimates (Castellazzi et al., 2018).
Nowadays, models are used to provide a cost-effective means of representing a system and aid in understanding the interplay between processes involved in a system. Various groundwater quantity models have been used in the past to enhance management of groundwater resources. The models include; physically-based numerical models such as MODFLOW (Lyazidi et al., 2020; Ostad-Ali-Askari et al., 2019), data-driven models employing artificial intelligence (Ibrahem Ahmed Osman et al., 2021; Sahoo et al., 2018) and a hybrid of the two (Malekzadeh et al., 2019; Rezaei et al., 2021; Zeydalinejad, 2022). In the past, physically-based numerical models were extensively employed in modelling and predicting groundwater levels (Ahmadi et al., 2022; Hussein et al., 2020). The physically-based numerical models however, are often time-consuming, require extensive amount of input data that are often not available and require some level of expertise in measuring the parameters used in the models(Sharafati et al., 2020a). The growing availability of big data, through remote sensing and Internet of Things (IoT) have made data-driven modelling such as machine learning (ML) a favourable modelling tool. Machine learning is a field of artificial intelligence that employs algorithms to learn complex patterns from data and able to predict unobserved data (Aderemi et al., 2022; Murdoch et al., 2019). According to Osman et al.,( 2022) ML models are estimated to be adequately efficient in modelling groundwater levels. Artificial Neural Networks (ANNs) and Support Vector Machines (SVM) rank amongst the widely used ML models in groundwater level modelling (Ahmadi et al., 2022; Sharafati et al., 2020b). Recently a growth in the use of ensemble machine learning techniques has been observed and one of the advantages that the ensemble techniques offer is lower computational cost (Wei et al., 2022). One of the commonly used ML ensemble techniques is Gradient Boosting (GB) algorithm. GB algorithm was successfully used in Iran for predicting monthly GWL (Sharafati et al., 2020a), In Slovenia, GB outperformed linear regression, random forest, decision tree algorithms for predicting GWL. Again in a Moroccan study where ten ML where compared in predicting groundwater withdrawals , GB outperformed all the algorithms (Ouali et al., 2023). In a South African study, GB also outperformed five ML algorithms among which ANNs and Support Vector Regresion (SVR), the dominantly used ML were included. Despite the accuracy and computational efficiency of GB , literature on the use ensemble ML techniques such as GB in predicting GWL is still lacking (Osman et al., 2022; Sharafati et al., 2020a).
Considering the growing rate of disasters related to climate change, freshwater scarcity and the gap that exist in adaptive capacity to climate change related disasters. This study leverages on the under exploited GB ML technique in predicting groundwater levels for the Upper Crocodile basin in South Africa. The aim of this study is to predict GWLs in an effort to contribute towards building adaptive and mitigation capacity to climate-related disasters. The objectives in achieving this main goal are as follows; 1) To demonstrate the effects of climate events on our groundwater resources through trend analysis. 2) To evaluate the relationships and determine lag times between input variables and the response variable using autocorrelations and cross-correlations. 3) To predict GWLs in quaternary A21D of the Upper Crocodile. A study of this nature is vital in aiding on-time decision making for groundwater resource management.
2. Materials and Methods
2.1. Study Area Description
The study area is the Upper Crocodile (West) located in North-West and Gauteng province of South Africa. Its geographic location is between the coordinates -25.678 latitude; 27.341 longitude and -25.678 latitude; 28.472 longitude. The towns covered in the Upper Crocodile (West) area are; Krugersdorp, Magaliesburg, Centurion, Midrand and Kempton Park. This basin is the second most populated basin in the country.
The Upper Crocodile is a semi-arid region. It is characterised by warm summers with average daily minimum temperatures of 10ºC to maximum temperatures of 30ºC and cold winters with minimum temperatures of 1ºC to maximum temperatures of 15ºC (DWAF, 2004) . The rainy season is in Summer which commences in October and ends in March the following year. Rainfall usually peaks in December and January. The mean annual rainfall ranges between 600 mm and 800 mm per annum (Schulze, 2012). The mean annual evaporation (MAE), exceeds the MAP, with the MAE at an average of approximately 1 600 mm per annum (DWAF, 2004).
Hydrologically, the Upper Crocodile (West) falls under the Limpopo water management area, primary drainage region A. The primary drainage regions are further broken down into secondary, tertiary and subsequently quaternary drainage regions. To be more exact, the Upper Crocodile (West) is in the secondary drainage region A2 and tertiary drainage A21. For this study, the quaternary drainage regions lying upstream of the Hartbeespoort dam were considered which makes up approximately 4 100 km
2 (i.e. quaternary drainage A21A to A21H). The main rivers in the Upper Crocodile are the Crocodile, Magalies, Jukskei and Hennops. The main dams are the Hartbeespoort dam in quaternary A21H and the Rietvleidam in A21A.
Figure 1 below displays the study area.
The main source of water supply in the study area is the Vaal dam, which is transferred through the Rand Water bulk distribution system from the Upper Vaal Water Management Area. The other important source of water supply in the study area is groundwater from dolomitic aquifers which are mainly found in the quaternaries; A21A (North East of Johannesburg), quaternary A21B (near Centurion) and quaternary A21D (near Krugersdorp). Dolomite-based aquifers (karst aquifers) are among the most prolific water-bearing formations due to their soluble nature (Abiye et al., 2015). Groundwater from these karst aquifers is used in the City of Tshwane, North East of Johannesburg and Krugersdorp area for agricultural, domestic and industrial use (DWAF, 2004). Projections reveal more potential for usage of this groundwater resource (Meyer, 2014), which may be beneficial for this rapidly developing area.
The Upper Crocodile is characterised by karst aquifers which are generally deep and high-yielding as well as fractured aquifers which are shallow and have lower yield (Abiye et al., 2015). The specific yield of the fractured aquifers in the study area ranges between 0.01 L/s and 0.98 L/s while the karst aquifers’ is in the range of 15 L/s to 124 L/s. This makes karst aquifers a vital source of water in the study area. Approximately 10 % of the City of Tshwane’s water supply is from these Karst aquifers. Recharge to the karst aquifers in Gauteng ranges between 7% to 15% of the Mean Annual Precipitation (MAP) (Meyer, 2014) and this could be attributed to the Mean Annual Evaporation (MAE) which far exceeds the MAP. Borehole depths in the karst aquifers reach 250 m depth and water table level can go below 10 m to 50 m (DWAF, 2004). The economic activities taking place in the Upper Crocodile (West) are mainly industrial, mining and residential which contribute a significant amount of the country’s Gross Domestic Product (GDP) (DWAF, 2004). This study area was chosen because of groundwater data availability and the economic importance of this area in South Africa. Due to data availability and the high spatial-temporal variability of the parameters under study, this study focused on quaternary A21D of the Upper Crocodile. The total area of A21D is 371.54 km2.
2.2. Data sources and acquisition
Daily rainfall data were obtained from the South African Weather Service (SAWS). The station that was selected for data analysis was 0512746 7. The choice of this stations was informed by historical data availability and the proximity of the rainfall stations to shallow groundwater monitoring stations. The station is located at latitude; -25.9436 and longitude; 27.9188, as shown in
Figure 1.
Groundwater data were obtained from the Department of Water and Sanitation (DWS) through the National Ground Water Archive, available online. The stations that were selected for analysis in this study are shown in
Table 1. Most of these stations are shallow and were selected based on data availability. Also, the stations unlike most stations had daily groundwater levels in some instances.
Raw hydrometeorological data very rarely come complete and ready to use. Especially in situ data, as a result cleaning and processing is required. One of the major challenges incurred with raw time series data is inconsistent frequency of data collection, handling blanks and the data are often big to detect the errors using human eye. This particular groundwater data set came with inconsistent frequencies, the frequencies were first fixed to a monthly frequency using the resample function in Pandas. Pandas is a data analysis tool built in the Python Programming language. After fixing the time frequency, the missing groundwater levels were filled in using interpolation.
2.3. Trend analysis
The Mann-Kendall test was applied in studying trends for data obtained from eight groundwater stations.
Figure 1 shows the location of the groundwater stations in relation to one another and the rainfall station. The Mann-Kendall test is a non-parametric test used to identify trends in time series data (Mann, 1945). The test has successfully been used in analysing hydro-meteorological datasets such as climate, environmental parameters, streamflow and groundwater levels in the past (Alhaji et al., 2018; Gyamfi et al., 2016; Mathivha et al., 2021). One of the advantages of this test is that it does not require data to be normally distributed (Alhaji et al., 2018), and hydrometeorological data are usually not normally distributed. The Mann-Kendall test is founded on the correlations between sequences and ranks. In this test, a test statistic S is obtained by computing the difference between subsequent data values. Equation (1) to equation (5) describe the test, x
i and x
j in the equations represent the values of a sequence where, j is greater i and n represent the length of the time series. The Mann-Kendall statistic (S) is given as (Mann, 1945) ;
where
satisfies any of the following conditions;
When n ≥ 8, the statistic S is approximately normally distributed with the mean and variance as follows:
where
is the number of groups of tied ranks and
is the number of ties of extent i. The standardized
is computed as follows;
ZMK measures the significance of the test. In this study the null hypothesis (H0) assumed that there is no trend while the alternative hypothesis (H1) assumed that there is a trend. The null H0 was rejected if at a significance interval of α = 0.05. The Mann Kendall analysis was conducted using Python.
2.4. Cross Correlation analysis
A cross-correlation function measures the strength of a linear relationship between two time series data depending on a time lag between them (Denić-Jukić et al., 2020). It assesses the similarity between a time series and a lagged version of another time series as a function of the lag (Rahmani & Fattahi, 2021). Theoretically, the cross-correlation functions are explained in equation (6) to equation (7) (Denić-Jukić et al., 2020), where
represents the input time series while
represents the output time series from a hydrological system. Equation (6) defines the covariance function between
and
, both with length
.
where
is the mean of
and
is the mean of
.
is the time interval in which the analysis is carried out or rather the total number of correlation coefficients obtained from the analysis and
is the number of time lags. Equation (7) defines the cross-correlation function between time series
and
.
where
is the standard deviation of
and
is the standard deviation of
.
The application of cross-correlations play a vital role in hydrology and this is evident in, Rahmani & Fattahi (2021); Seo et al., (2019); Valois et al., (2020). What is common in the three studies is that the studies cross-correlated climatic parameters with the availability of water resources, with Valois et al., (2020) focusing on precipitation and groundwater recharge which is closely related to this study. While Seo et al., (2019) demonstrated the importance of pre-proccessing climate data through using cross-correlations. In the study it was deduced that precipitation and temperature cross-correlations aid in improving hydrologic simulations. One of the vital features of cross-correlation analysis is its ability to express the interdepence of parameters, a feature that is helpful in selecting input parameters for models that are data-driven. In this study we cross-correlated rainfall with groundwater levels, for the study period 2011 to 2020 and the data were analysed at a monthly time step . Cross-correlations between rainfall from rainfall station 0512746 7 and the eight groundwater stations selected for this study were computed, the highest possible lags for this analysis was 117 lags .
2.5. Autocorrelation Analysis
Groundwater level fluctuations are greatly influenced by historical groundwater level (Wei et al., 2022), thus knowledge on how the previous month’s groundwater levels affect the succeeding month’s groundwater levels is essential. Autocorrelation analysis were thus conducted in this study to determine the interdependence of succeeding groundwater levels and to determine the optimal lag between succeeding records. An autocorrelation function is referred to the measure of strength of linear relationship among succeeding values in a time series depending on the lag time between the values (Denić-Jukić et al., 2020). The function is represented as follows:
In this study an autocorrelation analysis was done on historical groundwater levels for the study period 2011 to 2020.
2.6. Gradient Boosting Regression
Gradient boosting (GB) regression is a machine learning technique that uses ensemble trees by stacking them additively to provide final predictions (Sharafati et al., 2020b). It sequentially adds predictors to an ensemble with each predictor correcting its predecessor (Géron, 2019). A GB model consists of; the predictor and response variables, a base learner, a differential loss function and a number of iteration trees. In each iteration; a negative gradient is calculated, the base learner function is fitted to the negative gradient, then the base learner function is trained and the best gradient is found.
GradientBoostingRegressor from Scikit-learn was used and employed on Python. The input variables for the predictive model were, antecedent groundwater levels and rainfall. Rainfall and antecedent groundwater level are the most used input variables for predicting GWLs using ML models (Ahmadi et al., 2022). Lag times for the input variables were generated using cross-correlations for rainfall and autocorrelations for antecedent groundwater levels. The predictive model was then generated for the period 2011 to 2020. Training data for the model were from January 2011 to April 2018 and validation data were from May 2018 to September 2020.
2.7. Performance Criteria
The performance of the model was evaluated using the coefficient of determination (R
2) and the Mean Squared Error (MSE). The expressions for R
2 and MSE are given in (9) and (10) respectively.
where;
Oi = measured data, Si = predicted data, = mean of measured data, = mean of predicted data, n = number of observations.
R2 describes the proportion of the variance in measured data explained by the model, its values range between 0 and 1 with values close to 1 indicating a variance with less error and values close to 0 indicating a high error variance. While MSE measures the average of the squares of the errors. The MSE is always positive and it is 0 for predictions that are completely accurate. It captures the bias (i.e. the difference of estimated values from the actual values) variance (i.e. how far are the estimates spread out).