1. Introduction
Machine learning (ML) has emerged as a crucial domain in science and technology, exerting a substantial socioeconomic–environmental influence on various aspects of human and natural systems [
1,
2]. ML allows us to learn from vast amounts of data and improve the predictive performance of models. However, ML often employs complex algorithms, which results in black-box models due to their intricate internal processes that are not readily interpretable [
3,
4,
5]. Such opaqueness may lead stakeholders to overlook meaningful patterns or issues arising from hidden biases in the data. This hinders the handling of effective predictive resource management, mainly when based on large-scale ML [
6]. For example, the accuracy problem of inaccurate yield mapping is well-known due to errors inherent to a high data volume and algorithms' opacity [
7].
Considerable concern has been expressed about relying on opaque models that may result in decisions that are not fully comprehended or, even worse, violate ethical principles in terms of business and the environment or legal norms [
1,
8]. These risks are particularly relevant for critical decision making in real-life scenarios and for access to public benefits [
9], for example, digitalization in agriculture [
10] and terrestrial conservation [
11]. This partly explains the relatively low adoption rate of current ML-based decision support systems in many areas. Land managers, government agencies, and companies that incorporate black-box ML models into their practices, products, and applications potentially face efficiency, safety, and trust issues [
12]. Therefore, the lack of interpretability must be addressed, and increasing the explainability of predictive modeling for data analysis is becoming increasingly important for mitigating these unintended risks and promoting the correct application of ML models in critical domains.
In 2018, the European Parliament implemented the General Data Protection Regulation, which establish provisions regarding automated decision making [
1]. These regulations aim to ensure that individuals have the right to receive "comprehensible explanations of the underlying reasoning" when automated decision-making processes are used. Additionally, in 2019, the European Union’s High-Level Expert Group on Artificial Intelligence (AI) introduced ethical guidelines for trustworthy AI, and one of the critical requirements is explainability [
1]. This requirement has been incorporated into the proposed EU regulation known as the AI Act [
13], which establishes standardized rules for AI, thus affecting ML as a subfield of AI. Similar but nonregulatory proposals exist for AI risk management in the U.S., such as the "Identifying Outputs of Generative Adversarial Networks Act" and the "National Artificial Intelligence Initiative Act of 2020" [
14]. Overall, the consensus on the importance of developing practical explanation tools is growing. Meaningful explanations are critical for describing data, testing models, identifying potential biases, addressing risks, and fostering trust and collaboration between humans and their AI assistants. However, this remains an ongoing scientific challenge [
5].
An optimum model should be highly accurate and easy to interpret [
2]; however, despite the rising interest in these models, achieving both interpretable and highly accurate model outputs has presented a considerable challenge [
15], particularly in response to the abovementioned concerns at the management and policy scales. Consequently, the development of various explanation methods has increased for black-box models in both academia and industry [
16,
17,
18,
19,
20]. Explainable ML emerged in the late 2010s for prediction in different systems to better explain black-box models and comprehensively address diverse aspects of the food and agriculture sector [
21,
22]. Explainable ML seeks to enhance the interpretability of complex algorithms while still maintaining their accuracy. Prioritizing interpretable predictions is more important than solely focusing on accurate predictions when using ML models, especially for decision making [
9,
23].
The objective of the study was to showcase the potential of explainable ML algorithms in analyzing large-scale data. Agricultural system data and models allow us to explore the biophysical, practical, and social aspects of food production [
24,
25]. However, black-box machine learning often determines false relationships between components in the system, making it unsuitable for predicting and explaining [
26]. It is important to address these issues in this field because agricultural production is influenced by the management decisions of growers in response to changes in our climate and environment. We specifically focused on how different tree-based ML models can uncover novel patterns of organic matter in soil from data from Korea's field cropland, compiled on a national scale. Here, we targeted soil organic matter, as soils are integral to food production and have the potential to mitigate greenhouse gas emissions [
27], while the soil pools are the most vulnerable to land degradation and climate change [
17,
18] and are constrained by various social, economic, and political factors [
28]. This approach highlights the dominant controls of soil organic matter across fields in which its distribution interacts with the environmental state and the sociocultural matrix [
29,
30].
4. Discussion
Based on the results of the model performance analysis, we found that GB yielded the highest accuracy among the models evaluated, followed by RF and then DT. The precipitation variable (RN) consistently emerged as the most influential predictor in all these models under Korean conditions. Subsequently, our analysis showed that soil organic matter levels under field crops also depend on soil fertility properties, such as available K, available P, and Ca levels. In addition, analyzing the bee swarm summary plots provided a more comprehensive understanding of how other variables influenced the prediction (
Figure 2). All models showed positive relationships between nutrients and soil organic matter, but the intensity and shape of these relationships varied among the models. For example, the two-dimensional PD plots illustrated the divergent interaction effects of precipitation and exchangeable K (
Figure 5). The findings suggested that the relationship with organic matter is more robust when exchangeable K exceeds 1 cmol
+ kg
-1 and annual precipitation surpasses 1400 mm. Thus, organic matter content is contingent upon both precipitation and K on a large scale. This assessment provides a comprehensive understanding of the varying importance levels attributed to key predictors using distinct modeling approaches. However, the impact of these features on the model is still not fully understood [
48].
The SHAP method was also employed to assess the importance of variables at a specific local site (data not shown). The results indicated that pH was more important than RN, identified as the most influential factor in controlling global model behavior. Only the DT model revealed a conspicuous stepwise linkage, discernible at certain values of precipitation (1400 mm), available P (600 mg kg
-1), and K (1 cmol
+ kg
-1) (
Figure 4). This PD analysis underscores the nuances in how different models represent the intricate interplay between soil variables and organic matter. This insight provides valuable clarity regarding these two variables' complex interplay and collective impact on soil organic matter. Up to this point, our focus was on elucidating the overarching behavior of global models, aiming to comprehend the insights derived from the data. However, this approach falls short in explaining the nuances of local model behavior, a critical aspect in discerning the factors deemed important by the models when predicting values for specific instances.
Tree-based modeling can lead to the reliable prediction of organic matter content in soil and the identification of vital environmental factors that affect organic matter content. This knowledge is essential for managing the soil’s health and the sustainability of cropping systems, and for fine-tuning fertilizer and water use management. Moreover, when addressing agricultural challenges, especially with adopting alternative production systems [
49,
50], the results can support the selection of a farm location to improve not only soil organic matter and fertility conditions but also the marketability and profitability of crop harvests. For example, in Kenya, agricultural practices include the use of fertilizers, pesticides, and irrigation to enhance soil organic matter, potentially producing premium-market-priced organic products [
51]. When premium prices are available for organic produce, the organic system yields significantly higher net returns than conventionally managed systems, achieving a gross margin that is 1.3 to 4.1 times higher. Additionally, intercropping various crops enhanced overall productivity and profitability [
51]. Thus, these results provide insights into the economic implications of choosing an alternative farming system based on soil organic matter levels and related conditions. In enhancing soil organic matter content, several viable management approaches may offer additional means to boosting agricultural productivity, including soil amendment; yet, the economic evaluation of each of these methods has been limited, and a gap exists in terms of the comprehensive evaluation of the socioeconomic impacts, necessitating further research in this area. Petersen and Hoyle [
52] modeled the benefits of soil organic carbon, mainly focusing on the increased availability of nitrogen and increased plant-available water-holding capacity (PAWC). The value of soil organic carbon is estimated to be between AUD 7.1 and 8.7 Mg
-1 ha
-1 annually [134]. This valuation includes approximately 75% for carbon sequestration and smaller proportions for productivity improvements. The enhancements in PAWC (~5%) and nitrogen replacement value (~20%) contribute to higher agricultural productivity. An increased PAWC allows for increased water retention in the soil, and increased nitrogen availability supports healthier plant growth. Both soil quality and profit result in higher-quality land having a higher market value, providing an additional incentive to adopt a soil-conserving crop production system.
Mikhailova
et al. [
53] evaluated the monetary value of soil organic carbon stocks in the U.S., considering various factors such as soil order, depth, and geographic region. They estimated the total value of soil organic carbon storage to range from USD 4.64 trillion to USD 23.1 trillion. This valuation highlights the critical role of its management in delivering environmental and economic benefits. Similarly, Dube
et al. [
54] offered a detailed financial analysis of ecosystem services from healthy soils in Vermont, highlighting benefits such as increased carbon storage (USD 19 acre
-1 year
-1 in climate mitigation), reduced phosphorus losses (USD 8 acre
-1 year
-1 in water quality), erosion control (USD 2 acre
-1 year
-1 in waterway damage reduction), and enhanced water retention (USD 2 acre
-1 year
-1 in flood damage reduction), cumulatively valued at over USD 25 million annually. This emphasizes the economic importance of soil health investment and preservation. Hacisalihoglu
et al. [
55] used the “market value of soil” method to calculate the cost of soil erosion, considering nutrient loss and fertilizer market prices. They estimated an average economic loss of USD 59.54 per hectare per year in pasture lands and USD 102.36 in agricultural lands in Turkey due to soil erosion. This erosion, which tends to remove the nutrient- and organic-matter-rich topsoil, diminishes soil fertility, provides nutrients, sustains structure and moisture, and affects economic values by depleting a crucial soil component. Additionally, Kane
et al. [
56] found that counties with higher soil organic matter levels had increased yields, lower yield losses, and lower crop insurance payout rates. A 1% increase in soil organic matter corresponded to a yield boost of 2.2 ± 0.33 Mg ha
−1 and a notable reduction of 36 ± 4.76% in the average proportion of liabilities paid. Sparling
et al. [
57] quantified the monetary value of soil organic matter in enhancing crop production in New Zealand soils. This value was determined by estimating the worth of dairy milk solids, derived from a computer simulation modeling the yield of dry pasture matter and the accumulation of organic matter. The findings revealed that soils with lower organic matter levels yielded between 8.5 to 47.7 kg fewer milk solids per hectare annually, translating to a financial impact of NZD 27 to NZD 150 per hectare. Over recovery periods of 36, 90, and 125 years, the cumulative loss per hectare at Pukekohe, due to reduced productivity, was estimated at NZD 1239 with a 3.5% discount factor and NZD 772 with a 10% discount.
Finally, Fan
et al. [
58] showed that field practices with varying organic matter inputs could affect the total ecosystem service valuation in organic cereal crop production systems. This impact likely stems from altered soil properties due to long-term diverse field management. The authors estimated the economic value of ES in these systems under different management strategies, finding that the economic values ranged from USD 1492 to USD 1969 per hectare per year. Reyes and Elias [
59] demonstrated that drought and excess precipitation were the primary causes of crop losses in the U.S. from 2001 to 2016, leading to over USD 440 billion in economic damage. These studies emphasize the importance of understanding and predicting environmental factors in agriculture. Therefore, tree-based modeling for predicting organic matter in soil and identifying key factors can help improve soil health and contribute to more sustainable and profitable agricultural practices.