1. Introduction
The swift pace of industrialization and population growth has resulted in a notable surge in the production of solid waste [
1,
2]. Globally, the annual production of solid waste ranges from 7 to 9 trillion tons, with approximately 2 trillion tons estimated to be municipal solid waste (MSW)[
3]. A recent study forecasted that the worldwide production of MSW will reach 2.59 trillion tons by 2030 and 3.4 trillion tons by 2050 [
4]. Construction and demolition waste (CDW) accounts for more than 30% of the MSW generation worldwide [
5,
6], with Europe and the U.S. being responsible for 36% and 67% of the total generation volume, respectively [
7]. The volume and composition of CDW differ across regions, with China, the U.S., and Europe being the primary contributors [
8]. However, the recovery rate of CDW is not proportional to its generation volume and fluctuates significantly from 7% to 90% depending on the region [
9]. Moreover, while 75% of global CDW could be recycled, approximately 35% is disposed of in landfills [
10]. Given that the construction sector utilizes approximately 40% of the world’s raw materials [
11], produces 40% of waste [
12], and contributes to 25% of global carbon dioxide emissions [
13], a low recovery rate of CDW implies that the industry lacks sustainability. Facilitated by a substantial recovery rate of CDW, recycling offers myriad benefits for sustainable development across social, environmental, and economic dimensions [
14].
To achieve sustainable consumption and environmental integrity within the construction sector, stakeholders must cooperate within a circular economy framework. Such cooperation underscores the importance of acknowledging sustainable consumption and development initiatives, as well as implementing effective systems and solutions to drive efficiency and promote sustainability within the architecture, construction, and engineering industries. An example of an effective tool for such systems and solutions involves the assessment of the maximum economic and environmental benefits attainable from a structure prior to its dismantling and demolition [
15]. To ensure efficient waste management (WM), precisely forecasting the volume of waste generated is essential by accurately quantifying both the quantity and composition of waste [
16], which is indispensable for realizing sustainable WM practices. These endeavors will lay the groundwork for enhancing legislation pertaining to waste, conducting environmental impact assessments, assessing social and economic costs, designing WM systems, and planning the necessary infrastructure such as collection sites, recycling centers, landfills, and incinerators [
17,
18]. Furthermore, precise estimation of waste generation volume can serve as foundational data for effective WM practices, such as planning landfill capacity, implementing waste treatment levies or recycling incentives, and formulating comprehensive WM strategies [
19]. However, owing to numerous uncertainties, accurate prediction of the quantity of waste generated is difficult [
20].
Recently, owing to significant advancements in the WM domain, machine learning (ML) has emerged as an effective tool for addressing diverse challenges linked to CDW management. ML enables data-driven decision-making processes and delivers precise predictive data through the utilization of various technologies associated with data collection, processing, and information extraction [
21]. In the CDW field, numerous researchers have investigated waste generation prediction models, utilizing ML as a component of WM tools. For example, Lu et al.[
22] examined multi-linear regression (MLR) models to forecast the volume of renovation waste generated from renovation projects carried out in Hong Kong. Lu et al.[
23] developed MLR, decision tree (DT), gray model (GM), and artificial neural network (ANN) models to predict the volume of construction waste generated within a specific region in China. Based on a deep learning model, Akanbi et al. [
24] predicted the volume of waste generation destined for recycling, reuse, and burial within the context of demolition waste (DW) management. By applying random forest (RF) and gradient boosting machine (GBM) for 690 structures involved in demolition projects, Cha et al. [
25] developed a model for predicting the demolition waste generation (DWG) rate. Cha et al. [
26] developed a hybrid ML model that integrated principal component analysis with ANN and support vector machine (SVM) algorithms, which aimed to enhance the accuracy of predicting the DWG rate for structures undergoing demolition projects. Coskuner et al.[
27] developed an ML model applied with a multilayer perceptron (MLP)-ANN for predicting the CDW of the Askar landfill site in Bahrain. Gulghane et al.[
28] collected construction waste data from 134 construction sites in Nagpur, India, and developed an ML model applied with decision (DT) and k-nearest neighbors (KNN) algorithms. Hu et al.[
29] gathered construction waste generation rate data from 206 construction sites and used this to develop prediction models to which SVM, ANN, and MLR algorithms are applied. The aforementioned research on CDW management introduced ML models, which primarily targeted the overall CDW generation volume of a specific project or at a regional scale, as the resulting outcome. These results can serve practical purposes such as monitoring, data collection, and devising a comprehensive waste processing strategy on a large scale, guided by the total waste generation volume. To enhance CDW management effectiveness, more comprehensive management strategies concerning detailed assessments of the environmental impact of specific waste types, evaluation of processing expenses, and the selection of appropriate recycling methods are necessary. The ML models established in previous studies seem to exhibit research gaps and limitations in achieving a sustainable construction industry. Comprehensive strategies outlining environmental impact assessments, processing costs, and recycling plans for different types of CDW can provide a broader opportunity for fostering the development of a circular economy [
30]. To formulate comprehensive strategies for CDW management, comprehending the attributes of processed wastes is imperative, as well as the processing flow and categorization of CDW according to their characteristics and types. This understanding enables informed decisions regarding effective and efficient environmental impact assessments and recycling techniques for CDW.
With the surge in dismantling operations during the redevelopment projects of old buildings, an anticipated sharp increase exists in the generation of DW in Korea [
31]. Therefore, managing DW poses a significant hurdle for Korea’s sustainable development. To achieve sustainable development in the Korean construction sector, it is imperative to establish comprehensive DW management strategies. These strategies should encompass aspects such as DW recovery rates, enhanced recycling rates, landfill allocation strategies, environmental impact assessments, and considerations of environmental and social costs. Such strategies should be formulated based on a thorough understanding of the flow and volume of DW generation from older structures. Accordingly, gaining insight into the processing flow and volume of DW generated from outdated structures in Korea is crucial. Therefore, this study developed an ML-based management tool to predict the volume of DW generation, along with the quantities of recyclable and discarded or landfill building materials based on their characteristics, from old structures. By identifying the characteristics of such DW, this tool can aid decision-making processes in DW management by providing data on recycling recovery rates, recyclable DW generation volumes, and landfill waste generation to support landfill allocation plans. This study presents the following specific objectives:
Designing a model for predicting the volumes of recycling and landfill DW, taking the characteristics of DW generated from old structures within redevelopment zones in to account.
Testing a variety of potential sub-prediction models by determining optimal hyperparameters (HPs) and employing different algorithms.
Analyzing the factors affecting the volumes of recycling and landfill DW generated.
Proposing an optimal ML model for forecasting the volumes of recycling and landfill DW by evaluating the performance of training, validation, and testing models.
The remainder of this paper is structured as follows:
Section 2 proposes application approaches of various ML algorithms, along with the data used in model development, the theoretical basis of the applied algorithms, model optimization methods, and model evaluation techniques. In
Section 3, the developed sub-prediction models are assessed, and the best-performing prediction model is proposed. Additionally, the factors influencing the models are analyzed using SHAP analysis.
Section 4 offers various discussions based on the key research results, while
Section 5 presents the major findings and conclusions, and also addresses the limitations of this study.
2. ML-Based Models and Application
This study examined seven ML algorithms. In waste generation studies within the WM field, the most commonly used ML algorithms are ANN, DT, KNN, RF, and SVM [
32,
33]. These algorithms typically demonstrate exceptional performance in supervised learning tasks, handling non-linear data, identifying faults in datasets, and managing heterogeneous output parameters and numerical target variables [
34]. Accordingly, many researchers often utilize these algorithms (i.e., ANN, DT, KNN, RF, and SVM), which continue to be widely employed, and these algorithms should be prioritized during the development of prediction models. In addition, the LR algorithm is straightforward and facilitates easy interpretation of results and is thus a recurrent choice for model development in the WM domain [
32,
34]. Ensemble algorithms including RF, offer benefits such as improved prediction performance and enhanced generalization results compared to individual learning algorithms [
35,
36]. Differing from RF, the GBM algorithm is also a commonly used ensemble algorithm. Furthermore, Al Martini et al.[
36] and Jayasinghe et al.[
38] developed ML models with excellent prediction performance using GBM. The current study utilized ANN, DT, GBM, KNN, LR, RF, and SVR algorithms, which have demonstrated remarkable performance or are commonly employed in the WM field. The features of each algorithm and methods for enhancing their performance are described below.
2.1. Artificial Neural Network
ANN is a computing system comprising multiple layers of neurons, including input, hidden, and output layers. ANNs typically consists of three layers: input, hidden, and output. These networks incorporate non-linear transfer functions across multiple layers of neurons, thereby enabling the learning of both linear and non-linear relationships between input and output neurons. Due to their strong fault tolerance capacity, and suitability for depicting the complex relationships between variables in multivariate system, ANNs are frequently used in the WM field for developing AI models [
32,
39]. The architecture of ANN, such as a multilayer perceptron neural network, can achieve deep learning by expanding the hidden layer. Two fundamental HPs of an ANN model include the number of hidden layers and neurons, as well as the type of activation function. Additionally, other HPs such as epoch and regularization method (e.g., learning rate) also need to be properly selected for improving the generalization ability of the model and reducing the training time [
40].
2.2. Decision Tree
As a supervised learning model for tackling classification and regression problems, the DT algorithm is used for efficiently extracting a set of rules from unfamiliar data [
34] and offers numerous benefits. For example, the algorithm is particularly advantageous in terms of intuitively interpreting results, reducing computational expenses, and managing specific property data independent of omitted values. However, DT can be vulnerable to the problem of overfitting with data [
32]. To construct an effective DT model with optimal performance, devising a model that avoids overfitting is crucial. Thus, the complexity of a DT model needs to be controlled by tuning HPs such as the maximum depth or segmentation criteria of the DT model [
41].
2.3. Gradient Boosting Machine
Friedman [
41] first proposed gradient boosting (GB) as an algorithm used for classification and regression tasks. As a boosting technique, GBM stands out as one of the most robust ML algorithms extensively utilized in the engineering domain [
43]. This algorithm creates a powerful learner by continuously adding weak learners to the model. The performance of the model is enhanced while simultaneously minimizing loss or error by iteratively incorporating diverse prediction variables. As a result, the bias and variance of the prediction model can be drastically reduced [
42]. As a boosting-based ensemble learning technique, GBM can enhance model efficiency and accuracy by tuning the learning rate, which reduces the influence of each classifier, and “n_estimators,” representing the maximum number of estimators at which boosting concludes [
44].
2.4. K-Nearest Neighbor
The KNN algorithm is a simple and easily implementable supervised learning technique commonly employed for classification and regression purposes. This approach entails utilizing training data and computing distances for a predetermined k-value, where a set of k closest values are identified using clustering algorithms [
45]. Owing to its simplicity and intuitive nature, KNN has been widely employed for regression and classification tasks across various fields. Furthermore, the algorithm is commonly considered suitable for low-dimensional data characterized by a small number of input variables [
34]. KNN achieves a low error rate when managing extensive datasets and determines the optimal closest neighbors of a point using a minimal number of attributes (low dimension)[
34]. The most crucial HP in KNN is the k-value, which denotes the number of nearest neighbors considered. If the k-value is too small, underfitting may occur, whereas an excessively large k-value can cause overfitting, consequently prolonging computation time. Furthermore, the enhancement of the KNN model’s performance is significantly influenced by modifying a weighted function (such as uniform and distance) and the distance metric employed for prediction [
46].
2.5. Linear Regression
Combining statistical and ML techniques, the LR model is a linear equation comprising output values corresponding to specific input values. The main objective of regression analysis is to train a model using existing data and then make predictions by mapping an input value with an output value [
47]. Although LR involves relatively straightforward interpretation and minimal computational expenses, it also tends to exhibit bias [
34]. Despite these shortcomings, LR remains appealing due to its simplicity in terms of algorithmic design and the ease of analyzing the results [
32]. Owing to these advantages, LR has been consistently used as an ML algorithm for constructing waste generation prediction models. The hyperparameters that require tuning via regularization in the LR model are penalty terms such as “L1” and “L2” to enhance the performance of the model [
40,
48].
2.6. Random Forest
Proposed by Breiman [
49], RF is a classical bagging-based ensemble technique that creates bootstrap samples. RF creates a tree, also referred to as a weak learner, for each subset by drawing multiple subsets from the original dataset. The majority of results from each tree are aggregated to identify a strong learner, and the final prediction is derived from the average prediction of all sub-level models. Since it incorporates multiple decision trees, the RF technique delivers superior performance compared to single decision trees, reduces the risk of overfitting, and lessens the impact of outliers [
50]. RF is an ensemble model that uses bagging, with the initial parameter to consider being n_estimators [
44]. Subsequently, the performance of the model can be improved by tuning “max_features,” which represent the number of attributes used to create different subsets. Similar to GBM, performance can be improved in a DT-based ensemble model by tuning the split criteria and maximum depth [
40].
2.7. Support Vector Machine
An SVM is an effective ML algorithm that is commonly applied for classification, regression, and anomaly detection [
51]. SVM adopts the principle of structural risk minimization to address challenges such as a small number of data samples, non-linearity, high dimensionality, and local minima. SVM exhibits particularly outstanding performance in multi-class classification tasks. The basic idea is that input data are non-linearly mapped to a feature space, where an optimal linear decision function is determined. To classify the data effectively, a kernel function is used to map the data back to the original space [
52]. This approach offers effective flexibility and generalization abilities, which are particularly valuable for addressing non-linear problems. In particular, SVM has the benefit of circumventing overfitting [
53]. In an SVR model, the kernel type is a crucial hyperparameter that needs to be tuned first. In general, four types of kernel exist (i.e., radial basis function (RBF), and linear, polynomial, and sigmoid kernels), and selecting the appropriate kernel type is essential for improving performance. Besides kernels, the regularization parameter (C), which governs the complexity of the model, and epsilon (
), which denotes the distance error of a loss function, also influence the performance enhancement of an SVR model [
54].
5. Conclusions
Accurate information on the volume of DW generated is essential for achieving a sustainable and effective circular economy in the construction sector. In this respect, accurate data allow for the development of appropriate plans for managing the volume of recycling and landfill waste generated during demolition. Using data on structural characteristics and demolition equipment, this study created three and two models to predict recycling and landfill waste generation, respectively, based on the classification of DW properties. Various ML algorithms (including ANN, DT, GBM, KNN, LR, RF, and SVR) were tested to develop a prediction model with optimal performance. The findings indicated that the model utilizing the RF algorithm exhibited the highest performance. The average R-squared values for the training, validation, and test datasets were 0.993, 0.951, and 0.951, respectively, thus affirming its exceptional performance. In the validation and test results, the “recyclable mineral waste generation” model achieved an accuracy of 0.987, while the “recyclable combustible waste generation” model attained 0.972 accuracy. For the “recyclable metals generation” model, accuracy reached 0.953 or above, and the “landfill specified waste generation” model achieved 0.858 or higher accuracy. Lastly, the “landfill mix waste generation” model exhibited an accuracy of 0.984 or higher. The SHAP analysis established that floor area emerges as the most crucial input variable across the four models devised in this study: those for recyclable mineral waste, recyclable combustible waste, recyclable metals, and landfill mix waste. Furthermore, the type of equipment utilized in the demolition process was also revealed to be an important input variable for the generation of recycling and landfill wastes.
This study has several noteworthy implications for both academia and industry. In scholarly terms, this study suggested employing various ML algorithms to estimate the quantity of recycling or landfill waste generated from structural demolition. This approach will provide valuable insights for resource acquisition and waste handling, thereby contributing to the pursuit of sustainable and resource-efficient urban planning and construction practices. Practically, the results can aid local government officials or demolition companies in making informed decisions about resource allocation, optimizing workforce and infrastructure, maximizing recycled waste, and minimizing landfill waste.
This study has a limitation in terms of the insufficiency of data for certain types of information. For instance, the landfill specified waste generation model resulted in less accurate prediction performance than the other four models. The insufficiency of certain data can degrade the predictive capability of models. To address this limitation, data collection should be expanded through a wider range of comprehensive case studies. Consequently, the developed models can be enhanced and validated, ultimately offering detailed insights and information on WM. Continuously updating and retraining the ML algorithms employed to predict recycling and landfill waste volume generation can improve reliability and accuracy, thus offering valuable insights necessary for effective WM.