2.1. Model Development
This section provides a formal description of solving a reverse engineering problem with an MOO approach. State of the art MOO methods [
24] like genetic algorithms require to generate and evaluate new recipes in each step. Due to the high number of required steps, it is very demanding to perform on-line kMC simulations in loop [
9].
Instead, here a recipe search space R is selected, where each recipe r ∈ R consists of reaction time t, the initial monomer concentration cm,0, and the initial initiator concentration cini,0. Then, the corresponding monomer concentration cm(r) and molar mass distribution MMD(r) are obtained for each r ∈ R via kMC simulation. In the MOO approach, the input is a target MMD, MMDtarget, and the output is a set of optimal candidate recipes R*: ,, …,. R* is a subset of the recipe search space R, R*⊂ R with MMD() being close to MMDtarget as evaluated on the basis of the MSE as well as the maximal conversion and minimal time.
The optimization variables are presented in
Table 1. The lower and upper limits for the variables
cmon,0,
cini,0, and
t are defined by the simulated data. In
Table 2 for a specific recipe
r, the simulated values
cm(
r) and
MMD(
r) are used to calculate the values of three optimization objective functions: objective for the reaction time,
ft(
r), and objective for the mean squared error (MSE) between
MMDtarget and the predicted
MMD(
r),
fMSE.
fcm(
r) is used to turn the maximization problem of monomer conversion
fconv(
r) = (
cm,0 −
cm(
r)) /
cm,0 into a minimization problem, in which 1 −
fconv(
r) is minimized (
Table 2).
The final decision takes user preferences into account by assigning specific weights to the objectives applying the weighted sum method [
24,
25]. Thus, the multi-objective function can be represented in a single-objective way. Then, the values of this function are calculated for each candidate recipe and the recipes with minimal values of the objective function are selected as a set of optimal solutions. A weight
wi is assigned to each normalized objective function
fi as follows:
where
, i ∈ {MSE
, cm
, t},
r ∈
R, and
R is a polymerization recipe space. For clarity of presentation, the weight
wcm of the objective
fcm is also referred to as the weight of the conversion objective. If required the number of the objectives can be increased, e.g. the conversion of initiator can be added.
The steps of the proposed algorithm are presented in
Figure 2 as direct approach. First, a search space
RS ⊂
R is selected (for details see
Section 2.2) and for each
r ∈
RS MMD(
r) and
cm(
r) are obtained via kMC simulations. Then, MOO is performed over
RS. First, the objective function values are calculated as follows: the values for the objective
ft(
r) are already included in
r as
t, the values of the objective
fcm(
r) are calculated in advance for all possible
r according to
Table 2, and the values of the objective
fMSE are specified by
MMDtarget. Further, based on the calculated objective function values, the Pareto front points
Rpar ⊂
RS leading to
MMDtarget are identified. For this, the points from
RS are represented in the Pareto optimal space, with coordinates specified by the three objectives. The Pareto front points are found in this Pareto optimal space, such that one value of the objective function cannot be improved without downgrading the value of another objective function. Finally, for the Pareto front points
Rpar, the weights of each objective function are defined and a set of the best recipe candidates
R*⊂
Rpar according to Eq. 1 is selected.
The above-described algorithm is improved with respect to the optimization time by means of clustering the search space, which is illustrated in
Figure 2 as the clustering-supported approach. Clustering divides the search space
RS into a number of clusters as illustrated in
Figure 3. First, the search space
RS is clustered on the base of MMD(
r), which allows for selecting a cluster
Rtarget ⊂
RS containing the MMDs, which are the closest to
MMDtarget. In general, a larger number of clusters leads to a smaller number of MMDs per cluster gaining higher similarity of the distributions. However, there is less space for optimization regarding other objectives, e.g., such as polymerization time and monomer conversion. For this reason, an appropriate trade-off between the number of clusters and its size has to be identified. Upon appropriate clustering a cluster for the target MMD (
Figure 2, red arrows)
Rtarget is found. Then, by MOO the search space is reduced to the number of Pareto front points
Rpar ⊂
Rtarget. Finally, after defining objective weights, the best recipe candidates
R* ⊂
Rpar are found according to Eq. 1. Since MOO is applied to a single cluster
Rtarget ⊂
RS, which is considerably smaller than
RS, the optimization time is significantly reduced.
Different methods can be applied for clustering the search space on the basis of MMD. Clustering of distributions and their representation in histograms is an important topic, which attracted a lot of attention, because of specific metrics, which should be used to compare the distributions. One of the most popular and fast clustering methods is the kMeans method [
26]. A modified kMeans clustering algorithm was applied to the clustering of histograms [
27]. Further, a novel non-parametric clustering algorithm of empirical probability distributions was proposed [
28]. Here, the classical kMeans clustering method was used. This algorithm starts with a random separation of the MMDs into clusters. At each step, it recalculates the centroids of each cluster and relocates the data points to the new centroids. The clustering process finishes when the clusters are stable or the given number of iterations is reached. In this study, the simplest Euclidean distances are used for calculation of the distances between multi-dimensional data points, while specific metrics for clustering of distributions are also available [
27,
29].
There are different strategies for the data generation for the MOO procedure: use of exclusively in advance generated data from kMC simulation (
Figure 2, blue arrows), on demand kMC-generated data, ML-generated data, kMC-based and ML-generated hybrid data sets, etc. Currently, as a first step, the focus is exclusively on the use of kMC-simulated data.
2.2. Data acquisition and Processing
The in-house developed kMC simulator mcPolymer was used to carry out the simulations required for the generation of polymerization data according to the search space [
30]. The simulator allows for exporting the concentration profiles of all reactants and products as well as microstructural data like MMD, chain composition and branching of all polymeric species involved in the process. The simulator output was adapted to be well-machine readable. The data were filtered, further abstracted, logically connected, and stored in the well-structured no-SQL database MongoDB. The kMC simulated MMDs and monomer concentrations were obtained for the selected search space
RS, allowing for MOO and subsequent finding of the weighted optimal solution.
The kMC simulations were performed for radical polymerizations with VAc as monomer, tert. butyl peroxypivalate as initiator, and methanol as solvent. The simulations are based on a full kinetic model for the VAc radical polymerization containing all elemental reactions [
31]. The following polymerization conditions were used: constant temperature of 60 °C,
cini,0 in the range of 1.0 to 20.0 mmol·L
−1, and
cm,0 in the range of 2.0 to 5.0 mol·L
−1 with uniformly distributed grid size of
cini,0 (geometrically scaled grid points) and
cm,0 (arithmetic scaled grid points), resulting in 225 simulations of the process. The geometric scale was selected for
cini,0 to put more attention on the small values of this parameter. The polymerization process was simulated for a constant reaction time of 6 hours and the properties of interest were recorded every 20 minutes, thus, obtaining in total 18 data points at different time moments for each investigated property. Thus, the data set contains 4050 different MMDs.
For training the ML prediction models for single-objective reverse engineering the obtained data set was divided into training and test set in proportion of 80:20. The same data was used for MOO, again taking 80 % of the data as training set for the search space RS and 20 % of the data as test set Rtest. The test set Rtest contains 810 recipes, which correspond to a set MMDtest consisting of 810 kMC-simulated MMDs. The evaluation of all optimization approaches was performed with MMDtest with each element serving as MMDtarget. In order to test the MMO approach a single MMD is selected from MMDtest and used as MMDtarget for the MMO approach. The performance of the MMO is evaluated by testing it with every MMD from MMDtest.