The following sections analyze the results obtained by the different machine learning methods for each of the analyzed assumptions.
3.1. ML models using input variables Type 1
Table 2 shows the adjustments obtained for the selected machine learning models to predict
log Kd, developed with the same variable combination used by Li et al. (2020) [
15].
Table 2.
Adjustments for the different machine learning models developed using the input variables selection Type 1. Random forest (RF), support vector machine (SVM), and artificial neural network (ANN). T, V, and Q are training, validation, and query phases, respectively. Root mean square error (RMSE), mean absolute percentage error (MAPE), and correlation coefficient (r). The best models (regarding RMSE for the validation phase) are in bold.
Table 2.
Adjustments for the different machine learning models developed using the input variables selection Type 1. Random forest (RF), support vector machine (SVM), and artificial neural network (ANN). T, V, and Q are training, validation, and query phases, respectively. Root mean square error (RMSE), mean absolute percentage error (MAPE), and correlation coefficient (r). The best models (regarding RMSE for the validation phase) are in bold.
|
T |
V |
Z |
Model |
RMSE |
MAPE |
r |
RMSE |
MAPE |
r |
RMSE |
MAPE |
r |
PE/seawater |
RF |
0.525 |
18.67 |
0.983 |
0.380 |
7.48 |
0.988 |
0.523 |
13.38 |
0.979 |
SVM |
0.287 |
2.83 |
0.993 |
0.248 |
4.61 |
0.993 |
0.357 |
13.24 |
0.990 |
ANN |
0.257 |
3.13 |
0.994 |
0.236 |
4.42 |
0.994 |
0.561 |
23.33 |
0.979 |
PE/freshwater |
RF |
0.549 |
8.08 |
0.973 |
0.744 |
13.67 |
0.944 |
0.565 |
7.23 |
0.963 |
SVM |
0.536 |
8.93 |
0.976 |
0.770 |
11.14 |
0.945 |
0.475 |
10.46 |
0.978 |
ANN |
0.489 |
6.79 |
0.978 |
0.865 |
13.20 |
0.932 |
0.464 |
8.59 |
0.974 |
PE/pure water - 1 |
RF |
0.471 |
11.28 |
0.968 |
0.176 |
3.31 |
0.992 |
0.531 |
9.48 |
0.929 |
SVM |
0.356 |
5.93 |
0.974 |
0.132 |
2.06 |
0.993 |
0.411 |
6.90 |
0.958 |
ANN |
0.309 |
4.92 |
0.981 |
0.225 |
3.92 |
0.982 |
0.729 |
12.21 |
0.937 |
PE/pure water - 2 |
RF |
0.410 |
7.79 |
0.967 |
0.132 |
2.25 |
0.993 |
0.526 |
8.59 |
0.936 |
SVM |
0.466 |
9.51 |
0.955 |
0.205 |
3.47 |
0.983 |
0.439 |
8.10 |
0.953 |
ANN |
0.409 |
6.45 |
0.965 |
0.231 |
4.23 |
0.981 |
0.431 |
7.72 |
0.955 |
PP/seawater |
RF |
0.255 |
9.95 |
0.990 |
0.199 |
6.69 |
0.994 |
0.298 |
4.97 |
0.968 |
SVM |
0.260 |
5.12 |
0.989 |
0.244 |
6.92 |
0.988 |
0.779 |
7.32 |
0.817 |
ANN |
0.160 |
3.19 |
0.996 |
0.270 |
8.94 |
0.988 |
0.307 |
4.21 |
0.956 |
PS/seawater |
RF |
0.221 |
5.28 |
0.996 |
0.794 |
14.61 |
0.883 |
1.003 |
15.11 |
0.820 |
SVM |
0.554 |
23.10 |
0.969 |
0.524 |
21.69 |
0.965 |
0.436 |
12.85 |
0.988 |
ANN |
0.337 |
9.21 |
0.988 |
0.643 |
15.69 |
0.972 |
0.773 |
15.07 |
0.956 |
The first models (PE/seawater) correspond with ML models to predict the adsorption capacity for polyethylene in seawater. In this case, the three best selected models (each according to their RMSE value for the validation phase) can be seen. The model with the best adjustments is the artificial neural network (ANNlog) model (0.236), followed by the support vector machine (SVMn2 log) model (0.248), and finally, the random forest model (0.380). As can be seen, the three models present very high correlation coefficients for the validation phase, equal to or greater than 0.988; in addition, the mean absolute percentage error remains low, between 4.42% and 7.48%.
The good adjustments shown in the validation phase can also be observed in the training phase, where the values of RMSE remain similar to those of the validation phase, except for the random forest model, where the RMSE value grows to 0.525 (MAPE of 18.67%. It can be seen how, for the query phase, the model that provided the best result in the validation and training phases, the ANN model, presents the worst results in terms of RMSE and MAPE (0.561 and 23.33%, respectively) despite that maintaining a high coefficient of correlation (0.979). The other two models, the support vector machine and the random forest model, present slightly higher errors, in terms of RMSE, than those presented in the validation phase (0.357 and 0.523, respectively).
Given these results (
Table 2), it can be said that the three models show a good performance, although, for the query phase, the errors increase slightly. Despite this, the errors, in terms of RMSE, remain below the test error reported by Li et al. (2020) (0.752) for the model developed with these three input variables (
log D,
εα and
εβ).
The second group of models (PE/freshwater) corresponds to machine learning models that predict the adsorption capacity for polyethylene in freshwater. In this case, it can be seen, for the validation phase, that the errors made in terms of RMSE are closer to each other, compared to the model´s behavior in the previous block. In this case, it can be seen that the worst model corresponds to the artificial neural network (ANNlin) model that presents an RMSE of 0.865, followed by the support vector machine (SVMn log) model with a value of 0.770, with the best model being the random forest, which has a root mean square error of 0.744. In this case, it can be seen that the mean absolute percentage errors exceed those obtained by the ML models of the first block, varying between 11.14% and 13.67%.
For the training phase, it can be seen that the validation phase adjustments are improved in a significant way, presenting RMSE values between 0.489 and 0.549. For the query phase, it can be seen that the root mean square error remain at acceptable levels, corresponding to mean absolute percentage errors between 7.23% y 10.46%. The best model for the validation phase (RF with RMSE of 0.744) presents the worst results for the query phase (RMSE of 0.565) and vice versa; the best model of the query phase (ANN with RMSE of 0.464) is the worst model in the validation phase (RMSE of 0.865).
Despite these behaviors, the three selected models have suitable adjustments for all phases (
Table 2). If these models are compared with the model developed by Li et al. (2020), it can be seen that all of them improve the adjustments in terms of the RMSE value in the test phase (0.661 vs. 0.464, 0.475, and 0.565) for the model developed with this input variable (
log D).
The following two groups (PE/pure water - 1 and PE/pure water - 2) correspond to the machine learning models developed to determine the adsorption capacity for polyethylene in pure water. In this case, two blocks have been developed because Li et al. (2020) present two different approaches, one using two input variables (PE/pure water - 1 with log D and M'w) and the other one using only one input variable (PE/pure water – 2 with log D).
In our research, for the model development with two input variables (PE/pure water - 1), the case of 17α-ethinyl estradiol was not considered because the authors did not report the experimental log Kd value, so this model lacks this case. As expected, the models offer different results depending on the input variables. When two input variables are used, the model that presents the best results for the validation phase is the support vector machine (SVMn log) model, while when only one input variable is used, the best model is the random forest. It can be seen that the use of two input variables improves the adjustments in the training and validation phases (except for the RF model). For the query phase, the adjustments remain practically unchanged, except for the case of the ANN (ANNlin) model where the error, in terms of RMSE, drops from 0.729 to 0.431. As can be seen, the models developed with two input variables present low mean absolute percentage errors between 2.06% and 3.92% for the validation phase. This behavior worsens slightly for the training phase, passing to 4.92% and 5.93% for the ANN and SVM models, respectively, and 11.28% for the RF model. On the other hand, in the query phase, the MAPE values are between 6.90% and 12.21%. Despite the increase in both the RMSE and the MAPE values, these models developed with two variables seem to behave adequately to predict log Kd.
The models developed to predict the adsorption capacity for polyethylene in pure water (PE/pure water - 2) present, in general, slightly lower adjustments than those obtained by PE/pure water - 1). In this case, the best model, considering the value of the root mean square error in the validation phase, is the random forest model, which presents an RMSE of 0.132. This model presents, in the query phase, an increase in its RMSE value (0.526). The other two models, the SVM (SVMn2 log) model and the ANN (ANNlin) model present an RMSE value of 0.439 and 0.431 for this phase, slightly improving the results of the RF model for this phase.
According to these results (
Table 2), it can be said that the SVM and ANN models for
PE/pure water - 2 show good performance in terms of RMSE, and improve the adjustment of RMSE value for the test phase (0.471) provided by the model developed by Li et al. (2020) using only one input variable (
log D).
Before continuing, it is necessary to emphasize that all the machine learning models developed to predict the adsorption capacity for PE in the different water samples present, in terms of mean absolute percentage error for the query phase, adequate values, generally, below 10%. In other cases, the value is slightly higher (SVM for PE/freshwater and ANN for PE/pure water - 1), and in others, the difference is more significant, for example for the models intended to predict log Kd in seawater, which present errors between 13.24% and 23.33%.
The following models (PP/seawater) correspond to the models developed to predict the adsorption capacity of polypropylene in seawater. Based on the results provided in the validation phase, it can be said that the best model corresponds to the random forest model (0.199), followed by the SVM (SVMlog) model with an RMSE of 0.244 and, finally, the artificial neural network (ANNlin) model (0.270). The other statistics parameters of the validation phase show favorable behavior with MAPE values below 9% and with correlation coefficients above 0.980. For the training phase, the adjustments are similar to the validation phase, although an increase in the MAPE value of the random forest model is observed; even so, it remains below 10%.
For the query phase, it can be seen an inconsistent behavior. Thus, for the RF model and the ANN model is observed that the statistics remain close to the values of the training and the validation phase, while the SVM model suffers an increase in terms of RMSE that makes this statistic parameter reach a value of 0.779, lowering its correlation coefficient to 0.817.
Given these results (
Table 2), it can be said that the RF and ANN models can perform prediction tasks correctly. These two models present lower RMSE values (0.298 and 0.307) than the model proposed by Li et al. (2020) in the test phase (0.369), which was developed with two input variables (
log D and
εβ). The SVM model presents high generalization errors, which imply that it should not be used for prediction tasks. It should be noted that this SVM model, which is the one with the lowest error for the validation phase among all the SVM models developed, is the one with the highest error for the query phase. Other SVM models with close RMSE values in the validation phase (0.255 and 0.262) subsequently showed a better result in the query phase (0.287 and 0.266, respectively).
Finally, the last group of models (PS/seawater) developed corresponds to the machine learning models aimed to predict the adsorption capacity for polystyrene in seawater. Based on the results shown in
Table 2, and taking into account the value of RMSE for the validation phase, it can be stated that the model that presents the best behavior in this phase is the support vector machine (SVM
n2 log) model (0.524), followed by the artificial neural (ANN
lin) network (0.643) and the random forest model (0.794). Based on the results presented by the mean absolute percentage error, it can be affirmed that these models destined to predict the adsorption for PS in seawater are the models that present the worst adjustments for the validation phase, varying between 14.61% and 21.69%. Despite this, the correlation coefficients remain high, with values greater than 0.960, except for the random forest model, whose correlation coefficient falls to 0.883. For the query phase, the values in terms of RMSE remain close, except for the random forest model, keeping the MAPE values above 15.1%.
Taking into account the results shown in
Table 2, it can be concluded that the models to predict the adsorption capacity for PS in seawater do not present, in general, good results, except for the SVM model, which improves the RMSE value for the test phase (0.714) of the model developed by Li et al. (2020) with two input variables (
log D and π).
Taking into account the results obtained by the machine learning models that have used the same variables as Li et al. (2020), it can be said that, in general, the ML models improve the results obtained by Li et al. (2020). However, these types of ML models often need a large number of experimental cases and input variables to correlate the desired variable. Therefore, in this research, in addition to developing ML models with the variables used by Li et al. (2020), other ML models have been developed with more input variables. This is possible because Li et al. (2020) report eight different input variables; therefore, the results obtained by the models with the input variables selection Type 2 are shown below (
Table 3).
3.2. ML models using input variables Type 2
Table 3 shows the adjustments obtained for the machine learning models developed with the input variables combination Type 2 using all the available input variables (except for the cases in which the variable qH
+ is not possible).
The first models (PE/seawater) correspond with ML models to predict the adsorption for polyethylene in seawater. Unlike the Type 1 models for PE/seawater where three input variables,
log D,
εα and
εβ were used, in this new PE/seawater model, seven input variables were used (
Log D,
M'w,
εα,
εβ,
q-,
V',
π). It can be seen (
Table 3), based on the RMSE value for the validation phase, that the best-developed machine learning model is the SVM (SVM
n log) model, which has a value of 0.243, followed by the ANN (ANN
lin) model (0.306), being the random forest model, the model with the highest RMSE value for this phase (0.373). It is clear that for this phase, the three selected models present suitable adjustments. In addition, these models also present high values of the correlation coefficient, all greater than 0.990. These promising results are also obtained for the training phase, although the random forest model presents a substantial increase regarding RMSE (from 0.373 to 0.824).
For the query phase, the RMSE values obtained by the model show an increase, in the same way that happened for the models with the input variables selection Type 1. In addition, looking at the data for the query phase of
Table 2 and
Table 3, it can be seen that the incorporation of the five variables concerning the input variables selection Type 1 destabilizes the models' prediction, causing in all of them an increase in the RMSE value for this phase.
Despite this, the random forest and support vector machine models improve the results of the three-variable model proposed by Li et al. (2020) (0.693, 0.443 vs. 0.752, respectively, in terms of RMSE values for the test phase). The artificial neural network model developed with seven input variables presents an RMSE value close to the value of the Li et al. (2020) model for the query phase (0.762 vs. 0.752). Only the SVM model developed using the input variables selection Type 2 has improved the ML models that used the input variables selection Type 1.
The second group of models (PE/freshwater) corresponds to machine learning models aimed at predicting the adsorption capacity of polyethylene in freshwater using eight input variables (Log D, M'w, εα, εβ, qH+, q-, V', π). In this case, the best model, based on the RMSE value for the validation phase, corresponds to the ANN (ANNlog) model (0.446), followed by the SVM (SVMn) model (0.473) and the RF model (0.697). These reasonable adjustments are reflected in the high correlation coefficients, all greater than 0.960. This behavior is improved in all statistical parameters for the training phase, except for the mean absolute percentage error of the random forest model. For the query phase, these new models present RMSE values between 0.210 and 0.392, maintaining high correlation coefficients, all higher than 0.980. Comparing the ML models developed using the input variables selection Type 2 with the previously developed models using the input variables selection Type 1, it can be said that the ML models developed with eight variables improve the models developed with only one variable; the improvement is appreciable in all the parameters except three MAPE values.
Because of the results reported in
Table 3, it can be concluded that the RF, SVM, and ANN models developed using eight input variables improve the model developed by Li et al. (2020) (0.392, 0.210, and 0.272 vs. 0.661, respectively, in terms of RMSE values for test phase).
The next group of models (PE/pure water) corresponds with ML models to predict the adsorption for polyethylene in pure water. In this case, these models were developed using the eight input variables (
Log D,
M'w,
εα,
εβ,
qH+,
q-,
V',
π) instead of the two or one which were used by Li et al. (2020) and that was also used in the development of the previous ML models (
Table 2). In this case, the optimization process carried out by the RF model involved the elimination of the variable
V' in the trees of the forest.
It can be seen in
Table 3 that the best-selected model, according to the RMSE value for the validation phase, is the SVM (SVM
log) model, which presents a value of 0.154, followed by the RF model (0.204) and the ANN (ANN
log) model (0.403). As in the previous models developed using the input variables selection Type 2, the correlation coefficients are high, all greater than 0.930. This good behavior for the validation phase is also observed in the training phase, although a small increase in the errors made by the models can be seen. For the query phase, the different models present RMSE values between 0.433 and 0.551, keeping the MAPE value around 10% and correlation coefficients greater than 0.920.
Comparing the ML models Type 2 with the previously developed models Type 1, it can be said that, for the query phase, the random forest and support vector machine models present similar adjustments, in terms of RMSE, to those presented by the Type 1 models. Despite this, only the support vector machine model improve the results of the best model proposed by Li et al. (2020) (0.433 vs. 0.471, respectively, in terms of RMSE values for the test phase).
The next models (PP/seawater) correspond to the models developed to predict the adsorption for polypropylene in seawater using seven input variables (Log D, M'w, εα, εβ, q-, V', π).
Based on the results provided by the root mean square error in the validation phase, it can be said that the best model is the support vector machine (SVMlog) model (0.229), followed by the random forest model (0.245) and finally, the artificial neural network (ANNlin) model, which presents a higher error than the other two models (0.419). The correlation coefficients of the three models are greater than 0.975. This good behavior in the validation phase is also observed in the training phase, both for the random forest model and the support vector machine model; however, it should be noted that the artificial neural network model presents in the training phase an error of 0.029. The three models present RMSEs for the query phase between 0.215 and 0.494, with the support vector machine model offering the best results, as was the case in the validation phase.
If the results obtained by the models developed using the input variables selection Type 2 are compared with Type 1, it can be said that the increase in the number of variables has led to a significant decrease in the RMSE values obtained in the query phase for the RF and the SVM model. This can be seen in the support vector machine model, which goes from an RMSE of 0.779 to 0.240.
Given the results reported in
Table 3, it can be concluded that the RF and the SVM models developed using seven input variables improve the model developed by Li et al. (2020) with two variables (0.215 and 0.240 vs. 0.369, respectively, in terms of RMSE values for test phase). In addition, these models also improve the machine learning models developed using the input variables selection Type 1 except for the ANN model, which is slightly worse.
Finally, the last group of models (PS/seawater) corresponds to the ML models to predict the adsorption for polystyrene in seawater using seven input variables (
Log D,
M'w,
εα,
εβ,
q-,
V',
π). In these new models, a significant improvement can be seen in the validation and query phase adjustment parameters. In fact, for the validation phase, the RMSE values are between 0.290 and 0.475 for the SVM (SVM
n2 log) model and the RF model, respectively, while in the Type 1 models, the RMSE values were included between 0.524 and 0.794. Similar behavior is observed for the query phase, with the RMSE values between 0.385 and 0.873. As can be seen in
Table 3, the best model on this occasion is the support vector machine model, which also offers the best adjustment parameters for the query phase (0.385).
Given the results, it can be said that the SVM and the ANN (ANNlog) models developed using seven input variables improve the model developed by Li et al. (2020) with two input variables (0.385 and 0.407 vs. 0.714, respectively, in terms of RMSE values for the test phase).
Figure 1 represents the experimental and predicted values of
log Kd for the best machine learning models, according to RMSE in the validation phase) of each block shown in
Table 3.
Figure 1.
Scatter plots for the experimental and predicted values of log Kd for the selected ML models developed using the input variables selection Type 2. The dashed line corresponds to the line with slope 1.
Figure 1.
Scatter plots for the experimental and predicted values of log Kd for the selected ML models developed using the input variables selection Type 2. The dashed line corresponds to the line with slope 1.
Each graph shows that the adjustments of the training, validation, and query cases are conveniently fitted to the line of slope 1, although some deviation can be observed as it happens in a query case for the PE/seawater model or the PE/pure water model. In general, it can be seen that all the best models consistently predict the log Kd values.
Given the results shown in
Table 1 and
Table 2, key points can be drawn about the results obtained for the different machine learning models developed.
Regardless of the input variables chosen, there is always some machine learning model that improves the adjustments of Li et al. (2020) (in terms of RMSE for the query phase).
Including additional variables to develop the ML models does not always improve the variable selection carried out by Li et al. (2020). This is especially evident in the ML models destined to predict PE/seawater, where no model developed using the input variables selection Type 2 improves the models Type 1.
To the best of the authors' knowledge, increasing the number of experimental cases for each microplastic/water group used to develop the models would be appropriate. Presumably, this increase would help the models present better adjustments.