1. Introduction
Eutrophication, an essential manifestation of water pollution in freshwater reservoirs, results from the excessive nutrient loads that cause algal blooms [
1,
2]. The excessive algae and aquatic plant growth result in oxygen depletion, ultimately causing severe impacts on aquatic ecosystems and considerably increasing the costs related to water treatment [
3,
4,
5,
6].
Chlorophyll-a (Chla), a photosynthetic pigment in major algae groups, is widely used as a critical indicator of phytoplankton presence [
7,
8]. Because the abundance of algae can reflect the state of eutrophication, Chla is one of the most important parameters for evaluating the trophic condition of water bodies [
9,
10]. While Chla concentration has long served as a critical parameter in monitoring harmful blooms, accurately predicting Chla in reservoirs has proven to be a persistent challenge [
11,
12]. This difficulty is mainly due to the non-linear and non-stationary characteristics of Chla concentration, which are influenced by anthropogenic and hydrometeorological factors [
2].
In regions characterized by semi-arid climates, such as Northeastern Brazil, establishing an extensive network of multi-purpose artificial reservoirs has emerged as a reliable solution to confront the water scarcity challenges imposed by environmental constraints [
13,
14]. These reservoirs are notably susceptible to eutrophication due to hydroclimatic characteristics that combined favor photosynthesis and biodegradation [
15,
16], such as interannual variability of precipitation and stored volume [
17], high temperatures and evaporation rates [
18], and prolonged hydraulic retention time [
19]. Moreover, this susceptibility is further aggravated by continued anthropogenic pressure on water bodies due to internal enrichment from aquaculture practice [
3,
20], inadequate coverage of sanitation systems [
21], and a dense reservoir network [
22,
23].
Therefore, regularly monitoring Chla concentrations is crucial for implementing effective water quality management strategies to prevent further deterioration [
24]. However, traditional sampling methods are expensive, time-consuming, and impractical for many reservoirs [
25,
26]. As an attractive option, satellite remote sensing and machine learning (ML) techniques offer a cost-effective approach for monitoring Chla concentrations and their spatiotemporal variations, providing large-scale data on complex environmental systems [
27]. The Sentinel-2 constellation has been proved to be a valuable asset in monitoring inland and coastal waters once it has improved spatial resolution compared to other freely available sensing systems data, such as Landsat 8 [
28,
29].
The ML approach retrieves complex non-linear relationships within satellite data, capturing the underlying structure bonding the satellite data and the desired target variable [
30,
31]. Combining ML architectures with remote sensing data has been able to provide top-notch results in a plethora of scientific fields, such as solar irradiance forecasting [
32], mapping of mineral extraction sites [
33], forest fire mapping [
34], and crop water stress evaluation [
35].
ML and satellite data performance has also been explored for Chla monitoring inland and ocean waters. For ocean water, including coastal waters, specific Chla ML forecasting models, namely OLCI Neural Network Swarm (ONNS) and Ocean Colour 4 for MERIS (OC4ME), using Sentinel-3 satellite data, showed good performance for such a task [
36,
37]. However, other possible ML paradigms can be found for inland and coastal waters. A random forest (RF) based model was developed in [
38]. In their study, the authors used Sentinel-2 imagery to retrieve Chla concentrations for lake Chagan in China. Their proposed model provided good performance in determining the Chla concentrations and complyingwith the biological mechanism in lakes, offering robust results to seasonal changes.
In Cao et al. [
9], the authors used Landsat-8 remote sensing data together with extreme gradient boosting tree model (XGBoost) ML to determine Chla in lakes located in China. Their approach was implemented to analyze the spatiotemporal data from 2013 to 2018, and it demonstrated satisfactory performance in identifying the Chla behavior in the study location. In Hu et al. [
39], the authors developed methodologies to mitigate spectral noise in satellite data to improve the performance of ML models to estimate Chla in global oceans using remote sensing from several satellites. Their results proved that the support vector regression (SVR) was the best-performing ML approach, surpassing the traditional band-ratio models and providing reduced image noise.
The Group Method of Data Handling (GMDH) has also been applied to hydrological scenarios, including Chla estimation [
40], water quality prediction [
41], and image classification for plant diseases [
42]. However, there is a gap in the knowledge regarding the usage of this approachin modeling Chla concentration using satellite data, which the present study aims to fulfill. Additionally, the present study seeks to provide a deep insight into the performance of ML paradigms when applied to a vast area containing heterogeneous reservoirs.
Given this scenario, the potential to increase algae blooms and further degradation of these aquatic systems has increased the need to study often poorly monitored reservoirs in semi-arid regions. Therefore, this study aims to estimate Chla concentration through combined remote sensing and machine learning techniques. For this, the following specific objectives were pursued in this research:
A comprehensive investigation on several input parameters for Chla modeling, including all the 13 bands of the MSI onboard of Sentinel-2 constellation and 16 different spectral indexes.
A comprehensive analysis and characterization of all the 149 tropical reservoirs extensively spread across the state of Ceará, located in the Brazilian semi-arid region.
The usage of stepwise approach on parameters selection.
The investigation of different machine learning paradigms for modeling Chla values in heterogeneous reservoirs distributed over a vast region.
To fulfill the gap in the knowledge regarding the usage of the GMDH ML model for Chla modeling using remote sensed data and spectral indexes.
4. Discussion of the results
Mainly in tropical regions, the mechanisms that regulate phytoplankton growth require an advanced analysis due to the complexity and nonlinearity of the relationship between chlorophyll and physicochemical/environmental factors [
13,
143]. The approach presented in this study achieves this analysis using a highly heterogeneous collection of in-situ observations, investigating the performance of different ML models to determine Chla levels in 149 dams in the Brazilian state of Ceará. Although most dams were predominantly eutrophic throughout the years, different human activities and pollution sources have contributed to eutrophication processes lead to Chla spatiotemporal fluctuations and algal blooms [
17,
20,
144,
145].
We applied the forward stepwise approach to select the parameters, including bands and spectral indexes. To our knowledge, this method has not yet been applied in previous remote sensing studies. The forward stepwise approach s, significantly improving the Chla determination for the ML models. Existing literature regarding the influence of spectral bands on Chla determination can be found. In work [
146], the authors applied the SHAP analysis to explain the influence of the Sentinel-2 spectral bands for estimating Chla values. Their results elucidated that band 3, band 2, and band 8 were the top three most influential parameters for Chla determination. Bands 3 and 2 exhibited a positive correlation with the Chla, while band 8 showed a relevant negative correlation with the same parameter. A similar approach was conducted by Kim et al. [
147]. In their work, the SHAP analysis showed relevant participation in Chla prediction of red bands, i.e., bands 4, 5, and 6, and blue and green bands (
Table 1), a similar conclusion found by Ha et al. [
148].
The influence of different spectral indexes for Chla modeling was investigated in previous works. Castro et al. [
149] showed that indexes merging red and NIR bands yielded the best outcomes for determining Chla concentrations in small reservoirs. Similar conclusions were drawn by Buma and Lee [
150] and by Aubriot et al. [
151], which attested to the importance of bands within the red spectral range for Chla characterization in a lake in Chad, and for the Rio de La Plata, respectively. On the other hand, Viso-Vazquez et al. [
152], showed that the green band, i.e., band 3, promoted the highest correlation between the remote sensing data and Chla levels.
Our results show the GMDH methodology could efficiently identify the latent non-linear ties ruling input and output attributes. Moreover, the GMDH approach proved to be a more robust model for analyzing satellite imagery, being more resilient to noises and satisfactory results when handling different band information, which also can be attributed to its improved performance.
To better understand where the GMDH results lie within the literature, we compared our results with those found in previously published works. Such evaluation, however, may not be representative given the different methodologies used in different studies, models, input attributes, and the study area. Yet, comparing their evaluation metrics is still a viable approach to assess the performance of different models [
153].
Table 4 compiles the results for the GMDH model, while
Table 5 gathers the results found in the literature.
Table 5 shows that the models based on the deep learning methodology all performed remarkably well in determining Chla levels. Compared to the results found in our study, it is possible to notice that the R-squared values for references [
154,
155] are in the same range, over 90%. Nevertheless, regarding the work by Guo et al. [
154], the combined utilization of Sentinel-2 and Landsat-8 data significantly enhances the machine learning model's performance. This improvement is primarily due to the reduction in the revisit time, which subsequently minimizes the variance in the dataset. Consequently, this leads to a more robust and accurate machine-learning model. It is important to note that, although their results are slightly superior, the study location is limited to only one place.
Adding more water bodies is expected to add variance to the dataset. The model in reference [
156] was built using modeled data, including 11 times more water bodies than the previous studies. Their results proved to be slightly inferior to the other two previous works and more relatable to the ones found in this present assessment, with RMSE in the same magnitude order.
Regarding the R
2, our GMDH model showed a nearly 10% superior value. However, it must be stated that since the data used for training the ML model was simulated, it may not consider several naturally occurring situations. This would lead to a more homogeneous dataset with less variance, improving the ML model performance compared to the model proposed in our study. Another significant difference is the time window used for testing the developed model in reference [
156], which is significantly smaller than the one used in our model [
160], which reduces the dataset variance, improving the ML performance.
The DL approaches presented in
Table 5 still require a considerable amount of data to yield reliable outputs. Depending on the data availability for the study location, this characteristic may pose a major drawback. On the other hand, the GMDH paradigm can be promptly implemented with less information available and does not require extensive training datasets [
40], deeming them more flexible for different situations.
References [
157,
158] applied SVR to determine Chla. It is noticeable that although the models are the same, their methodology was disparate. Regarding reference [
157], the RMSE values were within the same order of magnitude. Besides that, the R
2 values were relatively close to each other. Contrarywise, reference [
158] performed a much broader study regarding several lakes spread across the Chinese territory. As previously mentioned, the inclusion of different lakes allows the forecasting model to generalize its results better, and thus provide more robust outcomes. Another critical difference between these two works is that the former implemented as input information just the spectral bands of the MSI onboard of Sentinel-2, while the latter used both bands and indexes. In this aspect, the present study achieved superior performance, attesting the GMDH model benefited from the inclusion of spectral indexes, which improved their Chla outcomes.
The work conducted by Aranha et al. [
13] shares the same location as the one used in this study. However, they used only a subset of 5 reservoirs of 149. In their approach, the authors fitted a regression line to their dataset using the three-band spectral index (3BSI), showing a good agreement between the index and the Chla values. A similar methodology was implemented in reference [
152], where the Toming’s index was used to fit a regression line for Chla values. These two studies implemented spectral bands to estimate Chla concentrations. By evaluating the R
2 metric values for [
13,
152], the proposed GMDH was superior over both studies, with significant improvements of 12% and 5%, respectively. Furthermore, a major difference between these two studies and the methodology presented in this work, is the data handling. The other studies used methodologies of fitting a regression line using the proposed indexes over the entire dataset, making no distinction between training and testing datasets. This is analogous to assessing our ML model's performance considering the training dataset only. Therefore, their methodologies lack generalization, being bound to a particular time and geographic location. In addition, while these approaches are considerably less complex than the proposed ML model, they provide valuable insights into Chla's behavior when analyzed using the evaluated indexes.
In the work conducted by Pompêo et al. [
159], the Sentinel Application Platform (SNAP) algorithm was evaluated for modeling the Secchi disk depth, the Chla concentration and the number of Cyanobacteria cells. The study location was the Cantareira System (CS) in the Brazilian state of São Paulo, Brazil. In their study, authors used in situ collected water samples from 6 reservoirs in the CS as ground truth for the evaluated water quality parameters. These data were latter compared with the Case 2 Regional Coast Color (C2RCC) atmospheric correction algorithm. It is a machine learning paradigm based on neural networks trained to reproduce top-of-atmosphere reflectance [
161]. The results of their study showed a good correlation between the modeled data and the real collected samples for the Chla, achieving an RMSE value of 2.3 μg/L and R
2 of 0.75.
A direct comparison of these results with the ones found in our study shows an R-squared value 16% superior for the GMDH approach, while the C2RCC had better RMSE values. The reasoning behind this discrepancy for the error metric is twofold. Firstly, even though both studies were conducted in Brazilian territory, the state of São Paulo is characterized with having a subtropical and tropical climate type [
162]. Such climatic configuration is much less prone to dry seasons compared to the studied semi-arid region of the Ceará state. This consistency leads to less fluctuation in Chla levels. Consequently, this could decrease the model’s variance, thus enhancing the model’s performance. Secondly, the work presented by Pompêo et al. used significantly fewer reservoirs compared to the present work. As previously mentioned, the reduced number of reservoirs hinders the model capacity to generalize unknown data while diminishing the dataset's variance, leading to improved error outcomes. However, despite improved error, the C2RCC model is a less robust approach when compared to the GMDH algorithm, as evidenced by comparing the R
2 metric of both models.
It is important to disclose that our study was conducted in a tropical semi-arid region, considering 149 dams, considerably larger than any of the presented literature. In addition, a broader investigation of the spectral indexes for the Sentinel-2 constellation was assessed compared to the references in
Table 5. This allows us to gather further insight into the impact of the spatiotemporal influence of both bands and indexes on the Chla prediction. From the results for parameter selection in
Figure 4 and
Figure 5, we observe that bands 8 and 11 do not compose the selected bands set. However, they are still present in the form of indexes. Moreover, a careful assessment of the GMDH results (available at Supplementary Spreadsheet 1) showcases the relevant contribution of bands 3, 4, 5, 7, 8, and 11 to the Chla prediction. This indicates that the predictive models benefit from a higher spatial resolution, as well as from green, red and infra-red bands composing the indexes, according to the literature [
13,
29,
64].
5. Conclusions
This study evaluated several input parameters for Chla modeling in a vast spatial coverage of 149 freshwater reservoirs spanning a semi-arid tropical region across the Brazilian state of Ceará. This evaluation was conducted based on satellite remotely sensed data and ground-truth Chla concentration measurements that reflected the temporal and spatial distribution, notably impacted by interannual rainfall variability. To this end, we investigated the performance of several ML approaches, namely the k-nearest neighbours, random forests, extreme gradient boosting, the least absolute shrinkage and selection operator, support vector machine, and GMDH.
Forward stepwise parameter selection was implemented to determine the best input attribute selection among the 13 bands from the Sentinel-2 constellation and 16 spectral indexes derived from such bands. The usage of this methodology applied to the remote sensing field is, to the best knowledge of the authors, new and proven to improve the outcomes of the investigated Chla forecasting models.
Figure 7 elucidates the performance of the GMDH model, showing that the model under investigation yielded excellent values for the Chla concentrations, achieving R
2 value greater than 90%.
It is important to emphasize that our methodology ensures a clear separation between training and testing data, unlike some studies found in the literature. The approach is substantial for applying ML models, as a way to promote more robust results and capacity to generalize unseen data. This approach is especially crucial for a large dataset such as the one used in the present work, comprising 149 reservoirs in a semi-arid region similar in size to England, where the spatiotemporal variation of the data is substantial. Regarding the spectral bands and indexes, we showed that the Chla modeling benefited primarily from red, NIR, and green bands.
Extensive comparison with the literature found studies showed that the used models in this work offer competitive results. It is important to note that such a comparison may be misleading due to the disparate methodologies and study locations, directly influencing the ML outcomes.
In future works, the addition of more indexes, as well as the merge of Sentinel-2 data with Landsat-8 could be investigated. Including more spectral indexes and the Landsat-8 MODIS data would provide more spatiotemporal information and reduce data variance due to the finer temporal resolution. Furthermore, implementing atmosphere correction preprocessing could also benefit the predictive paradigms, as it would reduce data noise, diminish the error variance, and improve the forecasting of Chla concentration.
Author Contributions
Conceptualization, I.E.L.N. and P.A.C.R.; methodology, I.E.L.N., P.A.C.R., F.A.S.F., J.V.G.T., and B.G.; software, P.A.C.R.; validation, I.E.L.N., P.A.C.R., F.A.S.F., J.V.G.T., and B.G.; formal analysis, I.E.L.N. and P.A.C.R.; investigation, I.E.L.N., P.A.C.R., F.A.S.F., J.V.G.T., and B.G.; resources, I.E.L.N. and P.A.C.R.; data curation, I.E.L.N. and P.A.C.R.; writing—original draft preparation, B.M.D.M.G., V.O.S., I.E.L.N. and P.A.C.R.; writing—review and editing, B.M.D.M.G., V.O.S., I.E.L.N., F.A.S.F., J.V.G.T., and B.G.; visualization, B.M.D.M.G., V.O.S., I.E.L.N. and P.A.C.R.; supervision, I.E.L.N., P.A.C.R., J.V.G.T. and B.G.; project administration, I.E.L.N., P.A.C.R., J.V.G.T. and B.G.; funding acquisition, B.G. and J.V.G.T. All authors have read and agreed to the published version of the manuscript.