2.1. The Study Area
The Central-South region is a pivotal area in China, renowned for its abundant natural resources and advantageous geographical location. This region encompasses the provinces of Henan, Hubei, Hunan, Guangxi, Guangdong, and Hainan, featuring a complex and diverse topography that includes plains, hills, mountains, and plateaus. The climate is rich and distinct across seasons, with ample rainfall and sufficient sunlight, creating an ideal environment for the development of agriculture and forestry. The Central-South region also boasts rich water resources from major rivers such as the Yangtze, Yellow, and Pearl Rivers, which not only satisfy local demands for production and domestic use but also provide surplus resources for external allocation.
In terms of industrial structure, the region is dominated by the secondary and tertiary sectors, with manufacturing and tourism being the two pillar industries. Economically, Guangdong Province, particularly the Pearl River Delta area, plays a leading role in economic development as one of the most developed provinces in China [
29]. Henan, Hubei, and Hunan are predominantly focused on agriculture and heavy industry, while Guangxi and Hainan exhibit unique advantages in tourism and tropical agriculture. The GDP total and per capita GDP of the Central-South region are both at a relatively high level, making it one of the significant engines of China’s economic growth. The population is substantial, primarily Han Chinese, but also includes various ethnic minorities.
Figure 1.
Study area (omitting Taiwan because of the absence of data).
Figure 1.
Study area (omitting Taiwan because of the absence of data).
2.2. Data Sources
As shown in
Table 1, we adopted an extensive data collection and processing strategy to analyze forest fire occurrences effectively. The data were organized into four primary categories: topographic, climate, vegetation, and social and human factors, each playing a crucial role in understanding and predicting forest fires.
Fire Data Utilization: In this research, the Moderate Resolution Imaging Spectroradiometer (MODIS) dataset on forest fires was employed, which includes 18,705 identified fire occurrences. This dataset is made available by the National Aeronautics and Space Administration (NASA) and can be accessed via NASA’s Earth Data portal (
https://earthdata.nasa.gov/) [
19]. This comprehensive dataset enabled a detailed analysis of forest fire occurrences, facilitating the development of predictive models and strategies for effective fire management.
The dataset encompasses detailed information on forest fires, including occurrence dates, geographic coordinates (latitude and longitude) of the fire incidents, confidence levels, brightness measurements, among other pertinent attributes. Specifically focusing on the span from 2000 to 2019, this study pinpointed fires in the southwestern region with a confidence level above 80% from the years 2001 to 2019.
Topographic Data: This included details like terrain elevation and slopes, which are essential in understanding how the physical landscape influences the behavior and spread of forest fires. Topographic features significantly impact fire propagation patterns, making this data category vital for our analysis.
Climate Data: Meteorological records, such as temperature, humidity, and wind conditions, were meticulously analyzed. These climate factors are pivotal in understanding the environmental conditions that contribute to forest fires, as they directly influence both the likelihood and behavior of these events.
Vegetation Data: The type and extent of vegetation were catalogued, as they significantly affect fire vulnerability. Data on forest coverage and other vegetation types were gathered to identify areas that are more susceptible to fire incidents, due to the availability of burnable material.
Social and Human Factors: Socio-economic parameters, including demographic data, economic metrics, population density, and residential area information, were considered to assess how human activities might influence fire risks. Factors such as agricultural practices, unauthorized burnings, GDP, and special holidays were also taken into account. While lightning data, a natural trigger for wildfires, was excluded due to reliability concerns, human-related factors were incorporated indirectly through these proxies.
The methodology involved harmonizing these diverse data types into a unified, consistent dataset. This included rigorous data cleaning to resolve issues like missing values, outliers, and duplicates, ensuring data integrity and precision. The final step was data normalization, which standardized various data formats and units, making them comparable and suitable for integration into our predictive model. This step was essential in maintaining uniformity across datasets, enabling a comprehensive analysis of the multiple factors influencing forest fires.
2.3. Method
In this comprehensive study,
Figure 2 serves as a pivotal illustration, mapping out the intricate technical journey undertaken to thoroughly explore the multifaceted issues surrounding forest fires. The roadmap delineates the sophisticated process of amalgamating a diverse array of datasets, each contributing a unique lens through which the phenomenon of forest fires is examined. These datasets span an extensive spectrum of information domains, including detailed records of fire incidents, land-use patterns, meteorological data, socioeconomic indicators, comprehensive vegetation characteristics, and nuanced terrain information. To harmonize these varied data sources and ensure their comparability and analyzability, the research employs advanced normalization techniques. These techniques adeptly minimize the amplitude disparities among the datasets, ensuring a harmonized data framework that facilitates consistency and equilibrium in the analysis.
Progressing beyond the initial phase of data preparation, the study ventures into an elaborate analysis employing a suite of sophisticated data examination methods. Through the utilization of kernel density analysis, the research identifies areas where fire incidents are notably concentrated, shedding light on the hotspots of forest fire occurrences. Spatial autocorrelation analysis is then applied to unravel the intricate spatial relationships between fire events, offering insights into the interconnectedness of these occurrences across the landscape. Furthermore, the deployment of standard deviation ellipses aids in delineating the directional trends and dispersion ranges of fire spread, enhancing our understanding of the dynamics of forest fire diffusion.
Building on these analytical insights, the study introduces the cutting-edge Light Gradient Boosting Model (LightGBM) algorithm, leveraging the power of machine learning to forecast potential forest fire risks with unprecedented precision. This predictive model integrates a comprehensive suite of factors, including historical fire incidents, meteorological conditions, land-use patterns, and socioeconomic indicators, weaving them into a predictive tapestry that forecasts the likelihood of future fire occurrences. By providing a scientifically grounded prediction of potential fire incidents, this model equips decision-makers with the critical information needed to devise targeted, proactive strategies for forest fire prevention and mitigation. Through this holistic approach, the study not only unveils the complex patterns and trends of forest fire occurrences but also contributes significantly to the domain of effective fire management, offering a robust foundation for informed decision-making and strategic planning in forest fire prevention efforts.
2.3.1. Kernel Density Estimation
Kernel Density Estimation (KDE) utilizes a smoothing technique to illustrate the distribution shape of data, making it exceptionally well-suited for continuous data analysis. Through this method, a kernel—often a Gaussian kernel—is positioned around each data point, with the data points being weighted according to the kernel’s bandwidth. This process effectively produces a comprehensive density estimate, offering a visual representation of data distribution [
38,
39]. Kernel density analysis in the context of forest fires transforms scattered incidents of forest fires into continuous density maps, offering an intuitive visualization of the spatial distribution of forest fires. This technique does not depend on prior distribution assumptions, which permits flexible adjustment of the analysis scale to accommodate various distribution patterns. In forest fire management, its application is instrumental in identifying high-risk areas, optimizing resource allocation, and uncovering potential factors contributing to forest fires. Consequently, it enhances the efficiency of forest fire prevention and response strategies [
40].
The formula of kernel density analysis is as follows [
41]:
The term f(x) denotes the kernel density estimate calculated within the specified threshold interval, indicating the estimated density of occurrences per unit area. The variable n stands for the total number of forest fires occurring within this interval, providing a quantitative measure of fire incidents. The parameter h represents the predetermined search radius or bandwidth for the kernel density estimation window, which determines the scale of smoothing applied to the data. Lastly, the symbol k refers to the kernel function employed in the analysis, which is a mathematical function used to weight the data points within the search radius, thereby influencing the shape of the resulting density estimate.
2.3.2. Spatial Autocorrelation Analysis
Spatial autocorrelation is a commonly used concept in geography and statistics, employed to describe the similarity or correlation between different locations in geographic space [
42,
43,
44]. It refers to whether there is a connection or similarity between adjacent or distant locations in geographical space and is typically used to study the distribution, clustering, and variations of geographic phenomena [
45].The advantages of spatial autocorrelation analysis in forest fire studies include revealing the geographical distribution patterns of fires, assisting in resource management and allocation, providing predictions and early warnings, optimizing monitoring networks, and supporting spatial decision-making to reduce fire risks and enhance fire response effectiveness. This analytical method plays a crucial role in forest fire research and control.
The formulas are as follows [
46]:
In this equation, I represents the global Moran’s I index, stands for the total number of spatial units, denotes the spatial weights between units and , and represent the values of variable for units and , and signifies the average or mean of variable x.
In this formula, is the local Moran’s I index, is the number of spatial units, represents spatial weights between units and , is the value of variable for unit , is the mean of variable .
In this equation, represents the local Moran’s I index, stands for the total number of spatial units, denotes the spatial weights between units and , signifies the value of variable for unit i, and represents the mean or average of variable .
Global and local autocorrelation analyses offer a nuanced lens for exploring spatial patterns across vast geographic expanses and within specific regions, respectively. These analyses classify spatial relationships into four distinct patterns: H-H (where high-value areas cluster together), H-L (where high-value areas are surrounded by low-value ones), L-H (where low-value areas are encircled by high-value ones), and L-L (where low-value areas cluster together). This framework facilitates a deeper understanding of how similar or dissimilar value areas associate with each other, revealing patterns of aggregation or dispersion. Such insights are crucial for devising targeted strategies in spatial planning and analysis, allowing for a more informed approach to managing geographical spaces and their inherent characteristics.
2.3.3. Standard Deviation Ellipse
The standard deviation ellipse is a visualization tool used in multivariate statistical data analysis [
47,
48]. It constructs an ellipse with a specific shape and orientation by considering the standard deviation and covariance matrix of the data, reflecting the dispersion and correlation of data points [
49]. This visualization tool is commonly employed for displaying data distributions, detecting outliers, and performing data clustering analysis. By examining the shape and orientation of the ellipse, it helps researchers gain a better understanding of the characteristics and structure of the dataset [
50]. In the context of forest fires, the advantage of using standard deviation ellipses lies in their ability to visually depict the distribution of forest fire data, identify clusters of fire sources and anomalies, and provide valuable support for spatial planning and data analysis, ultimately enhancing our understanding of the spatial features of forest fires and improving risk management and response strategies.
The formula is as follows [
51]:
In this equation,
and
represent the standard deviations of the variables
and
, while
stands for the number of observations. Additionally,
and
denote the averages or means of variables
and
, respectively.
Within this mathematical expression,
represents the tangent of the angle of rotation, whereas
and
signify the transformed or rotated coordinates of individual points
within the updated coordinate system.
In this equation, and represent the standard deviations of the transformed coordinates, while and denote the coordinates of individual points i after rotation within the updated coordinate system.
2.3.4. Light Gradient Boosting Model
The Light Gradient Boosting Machine (LightGBM) is an efficient gradient boosting framework that utilizes tree-based learning algorithms, specially optimized for handling large datasets while maintaining high training speed and accuracy [
52].LightGBM introduces two key innovations: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB), both designed to reduce computational and memory usage during training without compromising model performance. GOSS retains data points with larger gradients while downsampling others, thus reducing computational load while ensuring training accuracy. EFB reduces the dimensionality of features by bundling exclusive features (i.e., features that do not take on values simultaneously). These innovations allow LightGBM to achieve faster training speeds and lower memory consumption when processing large-scale datasets, compared to other gradient boosting methods, while maintaining or enhancing model performance [
54,
55,
56].
2.3.5. Evaluation Indicators
In the realm of machine learning and statistical analysis, a suite of metrics including Accuracy, Precision, Recall, F1 score, and AUC (Area Under the Curve) are pivotal for evaluating the efficacy of classification models. These metrics act as benchmarks to determine the effectiveness of a model in accurately segregating data into respective categories. They offer a comprehensive view of model performance by assessing different aspects of classification accuracy. The formulas for these metrics are as follows [
19,
57]:
In a binary classification context, such as assessing a forest fire prediction model, True Positives (TP) are cases accurately predicted as fire incidents, while True Negatives (TN) are non-incidents correctly identified. Conversely, False Positives (FP) refer to non-incidents erroneously classified as fires, and False Negatives (FN) are actual fire incidents that the model fails to detect. These parameters are crucial in evaluating the precision and effectiveness of the forest fire prediction model, as they measure the accuracy of the model in distinguishing between actual and non-existent forest fires.