1. Introduction
The need for spatial explicit results when assessing climate change impacts on species distribution promoted the search for a deep understanding about abiotic factors influence on species distribution patterns. A task facilitated by the increasing availability of environmental and species occurrence data with high resolution, namely for climatic scenarios, and dedicated tools, such as species distribution modelling techniques based on a wide array of algorithms based on correlation [
1,
2].
Species distribution models (SDMs) are widely used to predict species ranges and environmental niches, and their use has been increasing over the last two decades [
1]. Models of correlative nature are more common, since they relate species occurrence data and environmental variables, generating maps predicting past, present or future species distributions [
2,
3,
4].
The SDMs have been used for species conservation purposes and biodiversity management, like selecting locations for protected areas, habitat restoration actions, and/or species translocation, especially in the context of global climate change [
5,
6,
7,
8,
9,
10,
11,
12]. Under climate change scenarios, such an approach was used to assess possible impacts on biodiversity [
12,
13], aiming to assess potential changes in species suitable areas, from expansion [
14,
15] to contraction [
16,
17,
18,
19], and sometimes even extinction [
20,
21].
The choices made during the modelling process can significantly affect model predictive performance, and predictive results may vary greatly due to those choices [
1,
22,
23], so models must be fitted for the purpose, and options should be carefully considered [
12]. Possible inaccuracies or uncertainties sources can arise in different steps [
24,
25,
26,
27]. These sources may include data source - occurrences, environmental data – including future climate change scenarios; spatial niche truncation - use of a fraction of geographical and ecological range; campling effect - new conditions outside the range used to calibrate models; parametrization in the modelling process - variable selection and variable correlation; using only climatic variables; evaluation strategies; limited models discussion. Several authors have already looked into these issues, addressing different errors that can lead to inaccurate results [
2,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39].
Unreliable species' occurrence data can lead to models that underestimate suitable areas [
34], affecting the quality of performed models [
5]. Data sources for SDM can be collected from various sources, such as museums or other natural history collections, bibliographies, field surveys and databases. Data coming exclusively from museums or other natural history collections can be incomplete or biased, concerning the actual range of the species, since they were probably collected in more accessible locations [
24]. Otherwise, collecting data from systematic field surveys can lead to some areas oversampling, compared to others [
31]. Ideally, systematic surveys should be performed in the species total range area [
5]. These surveys can be feasible for species with small range sizes, but highly demanding for species with wide ranges [
36,
40]. Online platforms (e.g. GBIF) currently provide occurrence data commonly used to estimate climate change impacts on species distributions. However, differences in funding between nations and data sharing lead to differences in contribution, creating spatial bias due to uneven effort [
28,
34]. Also, data collected by the general public may have several errors, such as misidentification and georeferencing errors or sampling bias across more accessible areas, near cities and roads [
41], data storage and mobilization [
28,
34].
Future climate scenarios are based on emissions and development scenarios, established by the Intergovernmental Panel on Climate Change (IPCC). The more recent were released in their Assessment Report (AR6) [
42]. These scenarios are projected based on possible development scenarios which consider different levels of greenhouse gas emissions, population growth, economic and technological development, and land use [
43,
44,
45,
46,
47]. Although these scenarios are now robust projections and essential in climate change research and assessment [
45], they are still scenarios and may be prone to errors and uncertainties, as well as the models based on them [
29,
48,
49].
Study area limits are critical when modelling species suitability. When data from a restricted area of the species is used not all the abiotic conditions endured by the species may be considered, compromising the models' ability to capture the full range of suitable areas [
2,
50]. Leaving out marginal areas and marginal populations may also compromise results, since those populations may be adapted to more extreme conditions [
51,
52]. In these situations, called spatial niche truncation, only a subset of species ecological niche is considered so it can lead to incorrect forecasts when projecting future suitability [
53,
54]. Species occurrence data should be as comprehensive as possible to improve SDM results and represent environments and geographical areas where the species can live and disperse [
5]. In fact, in studies that assess climate change effects, it might be critical to consider areas beyond the species' present range, accounting for locations that may reflect potential future environmental conditions [
55]. Models cannot account for unknown conditions and assess their suitability in the future, thus, it is essential to include areas with conditions that do not yet exist in our study area which might be present in the future [
32,
33,
39,
52,
55].
Climate variables greatly influence plant species' spatial (and temporal) distribution. However, these are not the only variables that explain their distribution, especially when dealing with restricted areas and high-resolution data. Other environmental and abiotic variables (e.g., soil, topography, fire) are also important when modelling distributions and range shifts [
35,
38,
56,
57], and the rejection of non-climatic environmental variables must be based on variable selection methods. The inclusion of such variables might also support the identification of other restrictive factors, namely associated with land use, since areas with greater slopes present lower human pressure [
38] and register a higher number of occurrences, or might act as limiting factor themselves, like soil conditions, once it is unlikely that the species will be able to establish itself on unsuitable soil conditions even under appropriate climatic conditions [
56,
57,
58,
59]. So, the exclusive use of climate data can erroneously estimate a species' range, often producing overpredictions [
57]. However, not all available variables should be blindly included in the model since these variables may be highly correlated [
60,
61], sharing high amounts of information [
30]. In this case, variables with indirect effects (e.g. altitude) should be discarded, and correlated variables with direct influence (temperature or precipitation) [
62,
63]should remain, namely those with high biological significance for the species under analysis, contributing to i) simplify the interpretation of the model [
64], ii) avoid over-fitted results, and iii) eliminate crossed effects on the response curves of each variable, once inaccuracies caused by interactions with other variables will remain when correlated variables are in use [
2,
65], becoming difficult to disentangle the influence of each variable [
60]. This might be a severe drawback when a model is fitted on data from one area or time and projected to another area or period with a different or unknown structure of collinearity since collinearity between environmental variables is not constant in space and time [
30]. It is impossible to eliminate collinearity, but it can be reduced [
30]. There are several methods to quantify collinearity. One of the most effective is to select variables using a threshold under a specific value of correlation coefficients (e.g., |r| <0.7) [
30,
60]. Ignoring environmental variables that are determinant to tackle the species' ecology can lead to unlikely predictions of species responses to climate change (Guevara et al., 2018). Therefore, it is crucial to know the species' ecological preferences, to select the most meaningful variables to include in the model and to perform a model as reliably as possible [
24,
30,
60,
66,
67].
There are many techniques and modelling algorithms available to perform SDMs, which belong to different categories of models, such as regression methods – generalized linear models (GLM), generalized additive model (GAM) and multivariate adaptive regression spline (MARS); classification methods – classification tree analysis (CTA) and Flexible Discriminant Analysis (FDA); machine learning algorithms – random forest (RF), Boosted Regression Tree (BRT) and Maximum Entropy (MaxEnt); [
37,
68,
69], and others – Support Vector Machine (SVM) [
70]. No single model is superior in all situations [
70,
71], so the algorithm's choice depends on the data specificities and the study objective [
72].
Evaluation strategies or performance metrics are important to assess the discriminatory capacity of a model, or its ability to distinguish suitable from unsuitable conditions. There are several ways to assess model performance, such as sensitivity (the proportion of presences correctly predicted); specificity (the proportion of absences correctly identified); Cohen's Kappa Statistic (kappa); true skill statistic (TSS); percentage of correct classification rate (CCR); Area Under the ROC Curve (AUC); and error rate (ER) [
22,
73]. The most widely used evaluation metrics are AUC and TSS [
2,
74], but even the most widely used performance metrics have important limitations for ecological studies [
74,
75,
76]. They are designed to reflect the trade-off between sensitivity and specificity and generally weigh sensitivity and specificity equally [
77]. The single-use of AUC can identify well-fitting and strongly predictive over-fitted models [
48]. The AUC value depends on the size of the study area: if the area is large enough to comprehend different habitats from those occupied by the species, the AUC will be higher, even if the model is not that good, since more points with correct predictions of low suitability are considered [
75,
77]. The same occurs with the TSS, which tends to be correlated with the AUC. Also, TSS depends on species prevalence and may lead to misleading results [
78].
These common and recurrent mistakes during SDM application have led to the publication of several works that intend to standardize SDM procedures, improving their quality and reproducibility [
5,
6,
12,
64,
79].
In this context, the main objective of this study is to analyze the available literature dedicated to assessing climate change effects on plant distribution, based on niche modelling, to understand:
what are the most used data and methodologies in recent papers, namely those related to model calibration;
what are the most common deviations from consensual best practices and what information is most omitted from methodological descriptions;
identify how far the faults referred to above are identified and discussed;
identify new recommendations to improve SDM results, making them clearer and more comprehensive.
The analysis considers the methodologies used in recent papers, from species occurrence data to abiotic variables data sources, and its implications in models' accuracy and potential reproduction by pairs. Key aspects of the SDM elements were registered for each paper and assembled into a database, i) the source of species occurrence data, ii) the area analyzed, iii) the type of data (presence only, pseudo-absence, absence data), iv) the abiotic variables, v) the variables' selection, vi) the used algorithm (s) for modelling, vii) the model performance metrics, viii) the use of an ensemble model, ix) the climate scenario studied, x) the source of climatic models (databases and GMCs), and, ultimately, xi) the missing description of the used methodology.
2. Materials and Methods
This study aims to identify if best practices are followed when assessing changes in plant species distribution under climate change scenarios based on niche modelling in recent papers. The search was conducted in November 2022 in two databases: Web of Science (WOS) and Scopus. The following search equation was included, using Boolean search strategies: ("climate change" OR "global change") AND ("model*" OR "ecological niche model*" OR "species distribution model*" OR "habitat suitability model*" OR "range shift") AND ("R software" OR "maxent" OR "Biomod*" OR "GLM" OR "average model*" OR "ensemble*") AND ("flora" OR "plant*"). The search was carried out for the item "Topic" in the WOS Core Collection and the "Article title, Abstract, Keywords" topics in Scopus. Since modelling methodologies are constantly changing, a time limit was imposed on the research, considering only scientific papers published from 2018 to 2022. Only original articles were considered and other document types, such as review articles, books or book chapters, were removed. The search was based on PRISMA guidelines [
80,
81], and the flow chart (
Figure 1) resumes the different steps undertaken in the current study.
The duplicate records and articles in other languages besides English were initially removed. Unavailable documents were also excluded. Firstly, the title and abstract of the remaining documents were thoroughly screened and evaluated for inclusion in the study (
Figure 1). Secondly, those articles were further assessed according to pre-established exclusion criteria: a) not exclusively focused on terrestrial vascular plant species; b) dedicated to agricultural species and their production, such as vines, rice, and corn; c) on invasive flora; d) on aquatic environments or islands; e) on the evaluation of modelling methods rather than assessing climate change effects on species distributions; and f) lack of modelling for the future.
The screening of the databases found 240 documents complying the selection criteria, and afterwards, a representative sample was randomly selected (20%), a more handable number of articles. Finally, the 48 randomly selected articles were analyzed (
Appendix A).
The key aspects of the SDM elements were noted in each selected publication and assembled into a database, i) the source of species occurrence data, ii) the area analyzed, iii) the type of data (presence only, pseudo-absence, absence data), iv) the abiotic variables, v) the variables' selection, vi) the used algorithm (s) for modelling, vii) the model performance metrics, viii) the use of an ensemble model, ix) the climate scenario studied, x) the source of climatic models (databases and GMCs), and, ultimately, xi) the missing description of the used methodology).
4. Discussion
The information about the methodology used in each work is not always clear and complete. Some parameters are described more consistently, such as the origin of the data. However, many articles fail to specify other parameters, such as the use of absence or pseudo-absence points, ensemble modelling techniques, or even the GCM used. The same tendency, which limits the reproducibility of the studies, was noticed by other authors [
1,
6,
71]. This is a problem that has been addressed in recent literature by several authors, aiming to provide guidelines/checklists for future publications [
2,
5,
6,
12]. In addition to the gaps in the description of adopted methodologies, common and recurrent mistakes during SDM application have also been pointed out by recent studies [
36,
37]. These poor modelling practices can lead to inaccurate conclusions and poor planning of conservation actions [
64]. The examined studies had many similarities concerning the different elements analyzed. target species distribution area could have been more clearly stated, either total or partial. Over a third of the papers used the total range of the species, while the rest only considered a fraction. This is an important point, since models that rely on partial distributions may not be able to capture the full range of abiotic conditions in which a species can survive [
2], and marginal populations can have adaptations to more extreme situations [
51]. It is also essential to include location conditions that do not yet exist, in the study area (e.g. using buffer zones), but will probably exist, so that the model can assess the suitability of these conditions in the future. Ignoring this can lead to errors since the model cannot make accurate projections for unknown climatic conditions [
32,
33]. However, this seems common in ecological modelling exercises [
53,
54,
55]. For this reason, niche truncation and clamping can lead to incorrect predictions when projecting to future climatic conditions, since future conditions may be unavailable in the calibration area, but may be suitable for the species [
39,
53,
55]. This can result in predicting false local extinctions or extirpations and, hence, inaccurate predictions of future species suitability, especially at range margins [
50]. However, excluding areas under a climate that will no longer exist in the future, e.g. the northern range limit of a European species, may not be problematic, since those conditions will no longer be present [
50]. The explanation of the species' range and the study area should be well specified, together with the reasons for those choices [
25], which was not always the case. Nevertheless, only one work addressed superficially niche truncation.
Field surveys were the most popular data source, but performing models only with field data can lead to problems related to some areas being over-sampled, especially when species have a broad range area [
31]. Although systematically designed surveys covering major species ranges are recommended [
5], systematic surveys along all species range areas in major environmental gradients occupied can be source-demanding, expensive and time-consuming [
36,
40]. On the other hand, opportunistic sampling (e.g. GBIF) can have other problems, such as the misidentification of species, and spatial bias records due to uneven effort of sampling [
28,
34], but larger sample size of these type of data seems to compensate and outperforms systematic sampling [
89,
90,
91]. These biases and inaccuracies in distributional data can place heavy limitations on SDM studies and affect the quality of final results [
5]. About two-thirds of the studies used more than one source for occurrence data gathering, from fieldwork and large databases to locations mentioned in specific studies or herbaria. This can be a good strategy since the more information is given to the model, the better it will perform [
92] and data from different sources might complement each other [
89]. Also, when sample data are collected from broad geographical areas, including different environmental gradients, a higher possibility exists that environmental conditions limiting species distribution will be well sampled [
24].
The climate variables were used in all studies and the most common source was WorldClim, which was included in a large majority of the papers. The models were performed mainly at a 30 seconds spatial resolution (approximately ~1 km2), the highest resolution used. Although, depending on the study goal or for small-range species, a finer special scale should be used [
58]. The 30 seconds scale is often the finer available scale, which limits the possibility of performing finer scale models. Larger scale models may detect less variation in topography or soil conditions compared with finer-scale data, resulting in a lower ability for the models to discern topographic and soil variation within the landscape [
58].
However, non-climactic factors might also influence plant species distribution [
35,
56,
57]. Circa half of the analyzed studies used climatic variables only. Other environmental variables were not included in the model, which can overestimate habitat suitability for many plant species, both in the present and under future scenarios, since climate-based projections might integrate areas with unsuitable soil conditions [
57]. Some of these studies highlight this fact, pointing to this issue as a limitation [
93,
94], and others argue a lack of reliable data on a scale that would allow their inclusion in the model. Yet, including all climate and non-climate variables in the same models may not always be suitable [
6], since these variables may be highly correlated [
61], and their correlation can change through time [
37] making future projections less reliable.
Indeed, variable selection is a crucial step in SDM, but one-fifth of the analyzed articles lack to mention variable selection or do not describe the method used. Some simply use all the variables to perform models, without considering possible present and future correlations between them. Even though, most modelling algorithms are sensitive to high levels of correlation between variables. MaxEnt, the most used algorithm in the analyzed papers, seems to be capable of dealing with redundant variables and the independence between the degree of predictor collinearity and collinearity shift [
60]. So, the strategy of removing highly correlated variables seems to have a small impact on MaxEnt model performance [
60]. The articles that do not refer to variable selection used mainly MaxEnt. However, in those using other algorithms (BRT, RF, GLM, GAM, MARS and CTA) no justification is given for the absence of correlation analysis and variable selection. The variable selection based on correlation should be performed to simplify the interpretation of the model [
64]. Additionally, the species' ecological preferences should be considered, to select the most meaningful variables to be included in the model [
24,
30,
66,
67,
95].
Several methods are available to perform SDMs, no single one is superior in all situations [
70,
71] and they seem to have similar performances [
92]. Even though, BRT, MaxEnt, and RF were reported to be the best-performing modelling algorithms, while parametric and semi-parametric regression models (like GLM and GAM) can be good choices when the number of occurrences is very low [
70]. In accordance with other similar ones [
71,
96], MaxEnt was by far the most used algorithm in the screened studies, as previously said. Though, the percentage of papers using this algorithm was larger in our review than in others [
71,
96], and the only used algorithm in most papers. The MaxEnt is a machine-learning method [
97,
98], and some of its features can contribute to its popularity compared to other algorithms: is user-friendly, even for a beginner user; outputs are easy to access and read; it is very accessible, it can be used in open-source software or on free software R programming packages; there is no need to provide absence points and it generates significant results with a small number and spatially biased presence points, it is shown to deliver good results [
2,
58,
70,
97,
98,
99,
100]. Despite that, in climate change assessments and future projections, it seems advisable to use more than one algorithm to produce a final model, according to consensual best practices [
5].
Most papers used more than one climate scenario and more than one time interval. The Shared Socioeconomic Pathways (SSPs) [
42] are notably less used than RCPs [
88], probably because they are more recent and unavailable when some of these works were developed. On the other hand, the scenarios provided by [
87] had shallow usage, which makes sense, as more robust scenarios were available when these papers were published. The RCP8.5 was the most used scenario, although it describes a situation with very high anthropogenic greenhouse gas emissions without additional efforts to constrain them [
88]. Papers using this scenario also used at least other intermediate scenarios. Most screened papers displayed two different future time intervals, and a preference exists for more distant temporal periods. This makes sense and might be helpful when the goal is to plan management actions, especially for long-living species. Adaptive and management strategies require a longer-term perspective since areas managed nowadays must cope with the future climate conditions of at least several decades [
101,
102]. However, many species may not yet be able to be established in places that will only be suitable in a few decades. Therefore, not-so-distant periods might also provide meaningful information about transition areas.
The verified studies used a wide range of GCMs, a total of 32 considering all versions of the models, with most articles using only one GCM to perform the analysis. Since GCMs are projections and prone to errors, using more than one GCM has been emphasized to reduce uncertainty when projecting species distribution in time [
29,
49]. Still, more than one-third of the papers used more than one GCM: from 2 to 8. Some GCMs are more used than others. Those developed by the UK Meteorological Office are the most popular, followed by the National Science Foundation (NSF) and National Centre for Atmospheric Research (NCAR) from the United States and the Beijing Climate Centre Climate System Model, from the People's Republic of China.
When several algorithms or GCMs were used, ensemble models were often performed, in less than a fifth of the articles. Ensemble modelling is often considered to have better predictive results and to be more reliable than single models and is often used to reduce the degree of uncertainty in the model selection [
1,
70,
71]. Still, performing an ensemble using models with good and bad predictive capacity may not result in a good final model [
2]. Similarly, to other works [
1] and likewise in other analyzed parameters, the methodology for performing the ensemble is sparsely described in the analyzed works. Only two-thirds of the articles performing ensembles clearly stated the use of Biomod2, and only one-third described the choice of the best models to include in the ensemble, using a threshold, based on AUC or TSS.
A large majority of the papers used ROC/AUC to measure model performance. Although over a third of the studies used more than one method, often in a complementary way, the use of the AUC stands out. This holds true in another study [
74] and is possibly related to non-threshold dependency of the ROC/AUC, which is a metric provided by MaxEnt and is used in a wide range of applications related to producing predictions. Despite its wider use, the single use of AUC, or another single metric, can misidentify over-fitted models as well-fitting and strongly predictive [
48], therefore models should be carefully evaluated by specialists, to check whether they make sense ecologically for the target species [
103].
5. Conclusions
The current review identified 240 papers modelling plant species niches and possible future range shifts due to climate change, 48 of which were randomly selected and analyzed. Despite published standards for the use of niche models, recent studies focused on climate change still exhibit uncertainty related to inconsistent methodological decisions. Although modelling strategies and data sources are pretty consistent, the methodology is sometimes missing, which hinders the reproducibility of SDM studies and increases uncertainty considering the discussion of results.
Species occurrence data comprise mainly part of the species range and use more than just one source, with field surveys being the most popular choice. All papers used climate data, but other environmental variables were used in over half of the documents.
The choice of modelling algorithm was quite homogeneous, almost all documents using MaxEnt, often the only used algorithm. Using only one GCM was a popular choice although a best practice to use more than one, but no clear preference was found for a particular GCM.
The parameters analysis indicates that several articles base their models on several choices that may lead to inaccurate and possibly unreliable results. The definition of a study area not including the whole species' natural range, leaving out areas and environments in which the species can live and that have climatic conditions that might be more usual in the future, was common since over half of the studies only considered a part of the species range. Also, ignoring species' ecological preferences when choosing the variables to use in the model, both at the outset and after variables' selection, is another error that appears to be common, and which can lead to putative inaccuracies in the results.
Overall, there is a need to make the information clearer and more comprehensive in the SDM studies. In this paper, we emphasize that the information regarding the species being studied and the modelling process is often missing. Therefore, besides best practices referred to in guidelines papers previously cited, it is considered pertinent in future modelling studies to include and state the following information:
Target species natural range;
Considering the total species range in the study area, including a buffer to ensure the inclusion of different environmental conditions;
Compare the study area and the natural range of the species, and justify the exclusion of certain areas from the model, if this is the case;
Species' ecological preferences according to the bibliography, to support the selection variables selection;
Whatever the author's options, in the papers there should be a greater criticism of the obtained results, identifying putative constraints that may influence final results and which points can be improved in future studies.