In this section, we conducted an experimental study using two real industrial datasets: one related to predicting product quality in the plastic injection process and another related to predicting machine component failures. We selected these datasets because they originate from real-world industry cases.
3.2. Quality Prediction in Plastic Injection
This experimental study is about quality prediction in plastic injection molding process. The data was collected from sensors in the injection machine during the production of plastic road lenses [
92]. The dataset comprises 1451 records and consists of 13 process parameters, including the temperature of the melted material, mold temperature, filling time, plasticizing time, cycle time, closing force, peak closing force value, peak torque value, average torque value, peak back pressure value, peak injection pressure value, screw position at the end of holding pressure, and injection volume. In addition to these parameters, the dataset contains an attribute indicating the quality of the lenses. Authors in [
92] have defined four quality classes based on the standard "UNI EN 13201-2:2016" for lenses in motorized roads. According to this standard, the general uniformity
of the lens should be greater than 0.4 [
92]. Samples with
less than 0.4 are categorized as "Waste" (Class 1) and should be discarded for not meeting the standard. Those with a uniformity between 0.4 and 0.45 are labeled as "Acceptable" (Class 2), meeting the standard but falling short of the company’s higher quality target. "Target" (Class 3) includes samples with a uniformity between 0.45 and 0.5, considered optimal. Samples with
greater than 0.5 are labeled as "Inefficient" (Class 4) and should be avoided, as producing lenses with such uniformity exceeds standard requirements. The class distribution in the dataset is illustrated in
Figure 1, which shows that the dataset is quite balanced.
We conducted a performance comparison of various ML and DL models. We selected a set of models commonly employed in quality control applications for numerical data. The set of models includes SVM, Logistic Regression, Decision Tree, Random Forest, XGBoost, KNN, Naive Bayes, and MLP. Time series models like recurrent neural networks were not applicable in this study due to the absence of timestamp information in the dataset.
We divided the dataset into a training set, a validation set, and a test set, and the size of each subset is presented in
Table 1.
Figure 2 illustrates the class distribution within each subset of the data.
The validation set is employed to control overfitting during the training of the MLP model. To ensure a fair comparison of models, machine learning models were trained exclusively on the training set.
The results of the performance evaluation for various models on the test set are presented in
Table 2. Each model was fine-tuned by adjusting hyperparameters manually. The results show that ensemble models present good performances in terms of precision, recall, and F1-score. Random Forest shows the best scores for all performance metrics and is also the top-performing model identified by the authors in [
92]. Following Random Forest, MLP demonstrates a value of 0.95 for all three metrics, followed by XGBoost, Gradient Boosting, and Extra Trees. The random forest was configured with 100 decision trees. The architecture of the MLP model is presented in
Table 3.
Models such as KNN and Decision Tree also show good F-score values of 0.91, while the remaining models SVM, Naive Bayes, and Logistic Regression present lower performances. It can be assumed that these models are simpler and struggle to identify complex relationships among different features in the dataset.
Figure 3 illustrates the confusion matrix for Random Forest and MLP. It shows that the two models are able to effectively determine the quality of the parts with only seven misclassified parts for Random Forest and 15 misclassified parts for MLP. The result indicates that a well-balanced dataset contributes to achieving high-performing models.
3.3. Machine Component Failure Prediction
The objective of this use case is to predict machine component failures based on real-time data collected from the machine and its operating conditions. This allows for anticipating failures before they occur and planning maintenance operations in advance.
The dataset used in this section, provided by Microsoft [
93], includes detailed information related to the operational conditions of 100 machines in an anonymous industrial process. The dataset was also used in the work of [
21] where the authors proposed a prediction of failures for four machine components. The dataset consists of five distinct data sources [
21]:
Telemetry: includes measurements of machine pressure, vibration, rotation, and voltage.
Errors: log of recorded machine errors.
Machine: provides machine characteristics such as age and model.
Maintenance: contains the history of all machine component replacements.
Failures: information on the history of failed component replacements.
The initial dataset comprises 876,100 records of telemetry measured at hourly intervals. The date and time are rounded to the nearest hour for errors, maintenance, and failures data.
Since we have five data sources, it is necessary to gather the different data sources to create features that can best describe the health condition of a machine at a given time. For this purpose, we applied the same feature engineering techniques used in [
21] to merge the different data sources. To achieve this, additional information was extracted from the initial data sources to enrich the dataset [
21]. The means and standard deviations of telemetry measurements over the previous 3 hours and 24 hours were computed to create a short-term and long-term history of telemetry data. This allows for better anticipation of failures and provides an early warning in case of a failure [
21]. The "errors" data source helped determine the number of errors of each type in the last 24 hours for each machine [
21]. From the "maintenance" data source, the number of days elapsed since the last component replacement was calculated [
21]. The "Failures" data source was used to create the label. Records within a 24-hour window preceding the replacement of a failed component were labeled with the corresponding component name (comp1, comp2, comp3, or comp4), while other records were labeled as "None" [
21]. A more detailed description of the applied feature engineering process can be found in [
21].
After the feature engineering, the final dataset comprises 290,642 records with 29 features, including machine information such as the machine identifier (machineID), age, and the model of the machine; 3-hour and 24-hour rolling measurements for voltage, rotation rate, pressure, and vibration; error counts during the last 24 hours for different error types; time since the last replacement for each component; and a ’datetime’ column indicating the registration of each record at regular three-hour intervals.
We compared multiple ML and DL models by evaluating them on the same test set. As the dataset includes a "datetime" column indicating the timestamp of the data, we incorporated recurrent neural networks that take a sequence of records as input. To achieve this, we selected a sequence length of eight records to analyze information over the past 24 hours. The class of the last record in the sequence is considered as the sequence class. To avoid overlaps, sequences containing a record corresponding to a failure were excluded if the last record had a ’none’ class. We excluded these records along with the first seven records for each machine from the test set used for the evaluation of models that take one record at a time. This ensures a fair comparison between models handling a single record and those handling a sequence of eight records.
To develop models, we first divided the data into train, validation, and test sets. To achieve this, we followed the same partitioning as presented in [
21]. The records until 08/31/2015 1:00:00 are used as the training set to train the model. Those between 01/09/2015 1:00:00 and 31/10/2015 1:00:00 serve as the validation set, and those starting from 01/11/2015 1:00:00 are reserved to compose the test set.
Table 4 presents the number of records for each subset for both RNN models and other models. The difference in the number of records between RNN models and other models is a result of using sequences of eight records specifically as input for RNN models.
Figure 4 and
Figure 5 depict the distribution of classes in the training, validation, and test subsets for the two cases. We can observe that subsets have similar class distributions which indicates a good data partitioning. Class distributions in those subsets are also similar to that of the intial dataset.
Table 5 presents the performance results of the different models used in our experiment. Fine-tuning of hyperparameters for each model was performed manually. We can observe that SVM, Ensemble models (Random Forest, XGBoost, Gradient Boosting, Extra Trees), and DL models (MLP, SimpleRNN, LSTM, GRU) all demonstrated high performance with an F-score exceeding 0.95. XGBoost and GRU outperform other models with an F-score of 0.98, followed Random Forest with an F-score of 0.97. The architecture of the GRU model is presented in
Table 6. The XGBoost model is set up with 70 estimators and the Random Forest model is configured with 100 estimators.
The confusion matrices for these three models are shown in
Figure 6. We can observe that the primary sources of prediction errors are associated with the prediction of failure for components 1 and 3. The results obtained confirm that Ensemble Learning models could give good results for failure prediction with a numerical dataset. On the other hand, recurrent neural networks could capture relevant information in a time series for failure prediction.