3.1. Exhaust Emissions Results from Road Tests
Figure 4 presents the THC, NOx, CO, and CO
2 emission levels recorded during the tests. Notably, for THC and NOx, significantly higher emission values are observed for the cold engine, making the distinction between cold and hot emission data crucial for accurate modeling. CO emissions for the cold engine also contribute more to total emissions, whereas CO2 emissions remain at the same level as for the heated engine. The increase in THC, NOx, and CO emissions from the cold engine is closely related to the unheated exhaust emission control system of the vehicle. When the engine has not yet reached its optimal operating temperature, the combustion process is less efficient, leading to higher levels of hydrocarbons (THC) and nitrogen oxides (NOx) in the exhaust. An unheated catalyst and other components of the emission control system are not yet operating at full efficiency, increasing the emission of harmful substances [
32,
33]. For carbon monoxide (CO) emissions, the higher values for the cold engine also indicate inefficient combustion, typical of a cold engine, where the fuel-air mixture is not optimally burned. On the other hand, CO
2 emissions, which are primarily correlated with total fuel consumption, do not show as pronounced differences between the cold and heated engines. Although fuel consumption is slightly higher for the cold engine, as it requires more fuel to reach the optimal operating temperature, differences in CO
2 emissions are less noticeable. CO
2 is a direct indicator of the total amount of fuel burned, and these differences are less pronounced compared to other pollutants, which are more sensitive to the technical condition of the engine and the operating temperature. Therefore, analysis of CO
2 emissions may not fully reflect the impact of engine state on exhaust emissions, whereas THC, NOx, and CO are more sensitive to changes in combustion efficiency and the operation of emission control systems.
The data set of emission inputs presented in
Figure 4 was used to train the model. These data were divided into “cold” and “hot” emissions. Additionally, the data were further divided based on the clusters created for the explanatory variables, namely speed (V) and acceleration (a).
3.2. Clustering of Model Learning Inputs
To analyze emissions for different engine operating conditions, the Spectral Clustering algorithm was employed. This method allows for the identification of the optimal number of clusters in the data representing emission levels, based on explanatory variables such as speed (V) and acceleration (a). The choice of this method was motivated by the need to capture complex nonlinear patterns in the data that are difficult to identify using traditional clustering methods, such as k-means. To determine the optimal number of clusters, the elbow method and silhouette coefficient analysis were used, calculated for various numbers of clusters (ranging from 2 to 9), which allowed the evaluation of cluster cohesion (
Figure 5).
Figure 5 illustrates the evaluation of the silhouette score for different numbers of clusters (ranging from 2 to 9) used in the Spectral Clustering. The Silhouette Score is a measure of cluster quality, assessing how well points are grouped within a cluster and how distinct they are from points in other clusters [
34]. The horizontal axis of the chart represents the number of clusters, while the vertical axis shows the Silhouette Score value. From the shape of the chart for cold emissions, it can be observed that the highest Silhouette Scores are achieved with 4 and 5 clusters, with the highest value of approximately 0.67 for 4 clusters. This suggests that 4 clusters provide the most effective clustering, indicating that it is the optimal number of clusters for the Spectral Clustering method used. The Silhouette Score for cold emissions decreases significantly with six clusters and remains at a lower level for larger numbers of clusters, indicating that a greater number of clusters leads to a less distinct and effective data partition. Thus, the optimal number of clusters for analysis using spectral clustering is 4, as it produces the highest silhouette score, indicating the best cluster quality. However, for warm emissions, the best solution is to use two clusters.
In the study, Spectral Clustering was applied to analyze emission data collected from vehicle engines. Data were initially divided into two sets according to engine temperature: cold and warm. Data for the cold engine state included the first 500 records, while the remaining records were assigned to the warm engine state.
Spectral clustering was performed on each subset, with the optimal number of clusters determined on the basis of preliminary analysis. For cold engine data, four groups were selected, while two groups were chosen for warm engine data (
Figure 6). The clustering process utilized an affinity matrix based on the nearest neighbors to capture the internal structure of the data.
The results shown in
Figure 6 are illustrated using scatter plots, where each data point is colored according to its assignment of clusters. These scatter plots demonstrate how the data are grouped on the basis of the explanatory variables for future emission models: speed (V) and acceleration (a), with distinct differences between clusters clearly marked. Such visualizations provide information on clustering patterns and relationships between emission characteristics under different engine operating conditions.
3.3. Emission Modeling and Validation
For the data clusters created, emission models were developed for CO2, CO, THC and NOx for both cold and hot engine states. The modeling process employed various regression techniques to assess its effectiveness in the context of different engine conditions. The simulation was carried out separately for the data from cold and hot engines, allowing for a detailed analysis of the impact of the engine state on the prediction results.
The first step was to define the target variables and the regression models to be evaluated. The models were evaluated based on their ability to predict emissions using selected features such as vehicle speed and acceleration. Different regression methods were considered to achieve a comprehensive understanding of the relationships between variables and to accurately forecast emissions for both cold and hot engines. Linear regression, as a fundamental model, was used as a starting point to evaluate the linear relationships between vehicle features and emissions [
35]. Although linear regression is simple and effective, it may not suffice for more complex, nonlinear relationships. Therefore, polynomial regression was employed to model non-linear dependencies by adding polynomial terms, potentially better capturing intricate interactions between variables. Additionally, Lasso and Ridge regressions were introduced as regularization methods to address overfitting by penalizing large model coefficients. Regularization is crucial for feature selection and improving model generalization, particularly with high-dimensional data. Decision tree regression, on the other hand, introduces a hierarchical approach to data classification that handles nonlinear relationships well, but may be prone to overfitting if not properly pruned [
36].
To further enhance prediction accuracy, random forest regression was used, which combines predictions from multiple decision trees, improving model stability and accuracy. Moreover, Support Vector Machines (SVM) regression was included as an advanced method that utilizes kernel functions to map input features to higher dimensions, potentially capturing more complex, non-linear relationships. The testing of various regression methods allowed the comparison of their effectiveness and the selection of the most appropriate model for the prediction of emissions, considering the specificity and complexity of the data.
For regression models such as polynomial regression, the process involved creating new features through polynomial transformation and then training a linear regression model on the processed data. The results were evaluated based on the mean squared error (MSE) and the coefficient of determination (R2), providing information on the accuracy of the prediction and the fit of the model. For each target (e.g., THC, NOx, CO, CO2) and each regression model, MSE and R2 were calculated for both cold and hot engine data. In the case of polynomial regression, the process included feature transformation, model training, and performance evaluation based on predictions and actual values. For the remaining models, the build_and_evaluate_model function was used to automatically build and assess the model, providing relevant metrics.
The results of the best prediction methods for each emission component and engine state are presented in
Table 1.
Table 1 presents the evaluation results of various regression models for predicting the emissions of four chemical compounds. THC, NOx, CO, and CO
2, for both cold and hot engine states. For each chemical compound, the model with the best performance is indicated, along with quality metrics such as MSE (mean squared error) and R
2 (coefficient of determination).
For THC emissions in a cold engine, the best model was Random Forest Regression, achieving a low MSE of 0.00002 and a relatively high R2 of 0.74408, indicating good prediction quality. For NOx in the same engine state, polynomial regression was the best performer, with an MSE of 0.00006 and an R2 of 0.59200, suggesting moderate model fit. Gradient Boosting was the best model for CO emissions, though its MSE was 0.00291, and R2 was relatively low at 0.47986, indicating that the model may not be ideal for predicting this emission. For CO2 emissions in a cold engine, polynomial regression was also the best model, achieving an MSE close to zero (0.00321) and a very high R2 of 0.92200, indicating excellent fit of the model.
In the case of a hot engine, Gradient Boosting was the best model for THC, with a minimal MSE of 0.00001 and an R2 of 0.65674, reflecting good prediction quality. For NOx, polynomial regression was again the best model, but with lower MSE (0.00001) and R2 of 0.41565, indicating moderate fit. CO emissions in a hot engine were best predicted by polynomial regression, although its MSE was 0.00277 and R2 was relatively low at 0.21246, suggesting lower model quality compared to other emissions. Lastly, for CO2, polynomial regression was also the best model, with an MSE of 0.00221 and a very high R2 of 0.95100, indicating very good fit of the model.
These results demonstrate that different regression models exhibit varying effectiveness in predicting different types of emissions, and the choice of the best model depends on the specific chemical compound and the state of the engine. Model validation was also performed through visual interpretation of residual plots, histograms, real vs. predicted plots, and QQ plots. An example validation plot for the prediction of THC for a hot engine is presented in
Figure 7.
Figure 7 presents example validation plots for THC prediction in a hot engine. The residual plot displays the differences between the actual and predicted values of the model as points. The horizontal axis represents the predicted values, while the vertical axis shows the residuals, which are the differences between actual and predicted values. Ideally, residuals should be randomly distributed around a horizontal line at zero, indicating a good fit of the model. For the THC component in
Figure 7, the residual plot shows that most of the data points are clustered around the horizontal line.
The histogram of the residuals illustrates the distribution of the residuals in bar form. The horizontal axis shows the residual values, and the vertical axis represents the number of observations within each range of residual values. Ideally, the histogram should resemble a normal distribution, suggesting that the residuals are randomly distributed and the model fits the data well. The histogram of residuals presented shows a distribution similar to the normal distribution.
The real vs. predicted values plot shows how well the model predicts actual values. The horizontal axis represents the actual values, while the vertical axis represents the predicted values. Ideally, all points should be close to the diagonal line representing perfect fit (where predicted values equal actual values). The dispersion of points around this line indicates the model’s accuracy—smaller deviations suggest better model fit. For the prediction of THC, the real vs. predicted plot also indicates a good fit of the model.
The Q-Q (quantile-quantile) plot of residuals is used to assess whether the residuals follow a normal distribution [
37]. The horizontal axis shows the theoretical quantiles of the residuals of a normal distribution, while the vertical axis shows the empirical quantiles of the residuals. If the points on the plot align along a straight line, it indicates that the residuals are well-fitted to a normal distribution, suggesting that the model is appropriate. Most of the THC prediction data points are located along the straight line.