1. Introduction
Time series discretization transforms continuous values into discrete ones [
1]. Symbolic discretization is one of the most widely used approaches for transforming time series due to its ability to exploit data richness and its lower bounding properties, among others, [
2,
3]. On the other hand, the Symbolic Aggregate Approximation (SAX) is the most widely used symbolic discretization method due to its easy conception, implementation, and low computational time [
2,
3].
SAX employs the well-known Piecewise Aggregate Approximation (PAA). PAA reduces the dimensionality of the time series by averaging the values falling on one of the equal-sized time intervals defined at the beginning of the SAX procedure. Finally, each average value is mapped to a set of breakpoints computed based on the Gaussian distribution to assign the corresponding symbol. Despite SAX’s advantages, SAX has been criticized for its Gaussian distribution assumption and its inevitable information loss when the dimensionality of the data is reduced [
4,
5,
6,
7].
From competitive SAX results reported in the literature, several variants were implemented to solve its drawbacks. For example, Extended SAX (ESAX) [
3,
8] complements the PAA average values with two more minimum and maximum values and matches each with a different symbol. Furthermore, in
SAX [
9], the breakpoints computed by the Gaussian distribution are adjusted by the one-dimensional clustering algorithm of k-means (Lloyd algorithm).
On the other hand, rSAX (Random Shifting-based SAX) [
10] minimizes information loss with perturbations
of the breakpoint values generated using a uniform distribution. Similarly, 1D-SAX [
11] tries to minimize information loss, considering trends or slopes along with the average value. Therefore, the symbolic representation of 1D-SAX is obtained by two values: slope and mean value
, and the assigned character is provided by interleaving the binary representation
s and
a in each time interval.
Recently, He et al. [
12] proposed two new representations for symbolic time series based on the Transformable Interval Object approach (TIO), the Hexadecimal Aggregate approXimation (HAX), and the Point Aggregate approXimation (PAX). TIO employs the maximum and minimum values from a time series segment and, based on these values, computes the corresponding segment angle. PAX string is based on these TIO points, and HAX transforms the TIO points into a hexadecimal string.
Furthermore, Kegel et al. [
13] proposed two representation approaches, sSAX and tSAX, using two features of time series: season and trend, respectively. Similarly, Bountrogiannis et al. [
14] proposed two other representations to avoid the Gaussian assumptions of SAX: pSAX and cSAX. pSAX (probabilistic SAX) approximates the actual distribution accurately from data instead of assuming a Gaussian distribution by the Epanechnikov Kernel Distribution Estimator and Lloyd-Max Quantizer. cSAX (clustering SAX) discretizes the time series using the mean-shift clustering algorithm, automatically setting the number of breakpoints or alphabets.
All these methods use local search to find the discretization scheme. However, some methods focus on implementing a global search for this task. For example, Acosta-Mesa et al. [
15] proposed a discretization scheme found by a global search algorithm called Evolutionary Programming. However, Ahmed et al. [
16,
17] proposed using the Harmony Search algorithm to find the best symbolic time series discretization scheme. Furthermore, Fuad et al. [
18,
19] implemented two well-known evolutionary algorithms, Genetic Algorithm and Differential Evolution, to search for the optimal values for the breakpoints.
Another symbolic time series discretization approach is the enhanced Multi-objective Symbolic Discretization for Time Series (eMODiTS) [
20], which increases the search space by several unique breakpoints for each unequal-sized time interval. This method uses a multi-objective evolutionary algorithm to find suitable discretization schemes and obtain competitive results in the classification task. However, it implies a computational cost to evaluate each objective function, representing an essential disadvantage of the method. Therefore, several strategies have emerged to address this disadvantage. One of them is the implementation of surrogate models.
Surrogate models reduce the processing time in most complex optimization problems [
21]. Although these models emerged in the 1960s, their rise in complex problems has been observed in the last ten years (
Figure 1a). The areas where surrogate models are applied are diverse. According to the literature, engineering is the area with more applications of these models, followed by Information and Computing Sciences, Artificial Intelligence, and Aerospace Engineering; see
Figure 1b.
In engineering and computer science, surrogate models mainly concentrate on finding suitable designs for mechanical pieces. Remarkably, they focus on optimizing antenna design [
22,
23,
24,
25], microwave components [
25,
26], Aerodynamic Shape [
27,
28], or even optimizing groundwater exploitation [
29]. In addition, these works employ multi-objective evolutionary optimization and Kriging, as well as Support Vector Regression (SVR), among others, as surrogate models.
Moreover, surrogate models are employed for machine learning tasks. In this sense, an analysis of the use of surrogate models in each machine learning task is presented in
Figure 2. This figure shows that the
prediction task is the most recurring data mining task where an approximation model is used instead of the original prediction model. However, for tasks such as classification, the number of publications underperforms the prediction publications, resulting in an opportunity niche for researchers. Mostly, the surrogate model research applied to time series classification is focused on hyperparameter optimization [
30], deep learning [
31], and neuroevolution [
32]. Nevertheless, as far as state-of-the-art has been reviewed, surrogate models have been scarcely implemented in time series discretization.
Márquez et al. [
33] is one of the few works in which surrogate models were implemented to minimize the computational cost by identifying an appropriate discretization scheme for temporal data. The researchers applied surrogate models to the enhanced multi-objective symbolic discretization for time series (eMODiTS) [
20]. eMODiTS is a temporal data mining technique that discretizes time series by a unique set of value cuts for each time cut and three objective functions. However, before obtaining the value of each objective function for an individual, the data set is discretized using the discretization scheme that represents this individual. Consequently, this process implies a computational cost to evaluate each objective function, representing an essential disadvantage of the method. Therefore, surrogate models were employed to approximate the values of the three objective functions. As the individual representation size of eMODiTS varies across instances, the k-nearest Neighbor (KNN) was employed as a surrogate model. Moreover, the Pareto front is evaluated in the original models every
N generation and added to the training set to update the surrogate models. However, this results in a lessened update of the surrogate model, thereby reducing the model’s fidelity.
Consequently, our primary motivation for this research is to extend the findings of [
33], modifying the model update process. The proposed methodology evaluates the Pareto front using the original objective functions at each generation (individual-based strategy) and the current population at regular intervals (generation-based strategy). Additionally, each time the generation-based update is applied, the Pareto front (evaluated on the original functions) will be incorporated into the training set.
Therefore, the objectives of this research are described below:
To increase the number of evaluations conducted on the original problem functions, thereby improving the fidelity of the surrogate models.
To maintain the accuracy of the classification task achieved by the original model (eMODiTS).
To analyze the surrogate model behavior compared with SAX-based discretization methods to verify whether the proposal maintains, improves, or worsens by incorporating these models regarding the well-known discretization approaches.
The organization of this document is presented as follows. First,
Section 2 describes the materials and methods used in this research and the implemented methodology. On the other hand,
Section 3 presents the experiments performed to reach the objectives introduced in
Section 1 and the discussion of the results. Finally,
Section 4 describes the conclusions from the results presented in the previous section.
4. Conclusions
Surrogate models are an alternative tool for approximating objective functions in evolutionary optimization. This document implemented a surrogate model for estimating the objective functions of eMODiTS. This research is an extension of the approach proposed by Márquez-Grajales et al. [
33]. Since eMODITS employs individuals of different sizes, the kNN algorithm and DTW were incorporated as the surrogate model in eMODiTS. This surrogate-assisted eMODiTS was called sMODiTS, and its behavior was compared against the original model.
The results suggest that each version of sMODiTS implemented presents a behavior similar to the original approach (eMODiTS), which does not present a significant statistical difference. However, eMODiTS still has a low error classification rate in most datasets, unlike the sMODITS method. Moreover, regarding the prediction power of the surrogate model, the metrics suggest that sMODiTS presents a low accuracy in estimating the values of the original eMODiTS fitness functions.
On the other hand, the Pareto fronts of both approaches were compared using MOEA performance measures to evaluate the behavior of the final solutions found by each approach. These measures indicate that the performance of the sMODiTS algorithm is competitive compared to the eMODiTS algorithm since the sMODiTS Pareto front is close to the eMODiTS Pareto front.
Regarding computational cost, the number of evaluations performed by sMODiTS is lower than those achieved by eMODiTS, with reduction percentages of the use of the original objective functions between 15% and 80%, reducing the computational cost of the original algorithm.
Finally, the statistical test indicates that sMODiTS achieves competitive results compared to SAX-based symbolic discretization methods. There is no statistical difference among all the compared methods, and it ranks lower than seven of ten approaches.
In summary, the surrogate models used in this study approximated the actual model outcomes while significantly reducing the number of computationally intensive objective function evaluations. They also preserved the effectiveness of the time series discretization task compared to methods that have demonstrated competitive performance in tested problems. Furthermore, although the surrogate models’ accuracy is low, they are suitable for problems where the solutions have different lengths from each other, particularly in the time series discretization proposed by the eMODiTS approach. Consequently, the objectives set out in
Section 1 have been achieved, verifying that the surrogate model maintains the original model results with a competitive approximation to the eMODiTS solutions but with a lower computational cost.
As a future work, we propose to implement other surrogate models (one different per each objective function) capable of handling different-sized solutions to increase the sMODiTS’ estimation accuracy concerning the fitness functions of the eMODiTS method. Moreover, suitable initial sampling methods can be incorporated into this solution to achieve a reliable approximation of the original model. Finally, a comparison of different training set codifications can be performed to evaluate if this feature impacts the fidelity of the original models.
Figure 1.
Bibliographic analysis elaborated from
https://www.dimensions.ai/. (a) Keywords used:
surrogate model or
approximation model. (b) Keywords used:
surrogate model optimization or
approximation model optimization.
Figure 1.
Bibliographic analysis elaborated from
https://www.dimensions.ai/. (a) Keywords used:
surrogate model or
approximation model. (b) Keywords used:
surrogate model optimization or
approximation model optimization.
Figure 2.
Marimekko chart searched at
https://www.dimensions.ai/ using the keywords Surrogate model optimization OR approximation model optimization and clustering, classification, prediction, associative analysis, characterization, and feature selection.
Figure 2.
Marimekko chart searched at
https://www.dimensions.ai/ using the keywords Surrogate model optimization OR approximation model optimization and clustering, classification, prediction, associative analysis, characterization, and feature selection.
Figure 3.
Representations for PAA and SAX algorithms. The final string is obtained by mapping each PAA coefficient into a symbol .
Figure 3.
Representations for PAA and SAX algorithms. The final string is obtained by mapping each PAA coefficient into a symbol .
Figure 4.
eMODiTS scheme discretization approach where each word segment contains its breakpoint scheme . In this example, the final string is .
Figure 4.
eMODiTS scheme discretization approach where each word segment contains its breakpoint scheme . In this example, the final string is .
Figure 5.
eMODiTS’s flowchart. Dotted and light gray boxes represent the NSGA-II stages necessary to adapt them to the eMODiTS representation schemes.
Figure 5.
eMODiTS’s flowchart. Dotted and light gray boxes represent the NSGA-II stages necessary to adapt them to the eMODiTS representation schemes.
Figure 6.
Representation of an individual in eMODiTS.
Figure 6.
Representation of an individual in eMODiTS.
Figure 7.
Crossover operator based on the one-point approach. The dashed line represents the cuts performed by each parent.
Figure 7.
Crossover operator based on the one-point approach. The dashed line represents the cuts performed by each parent.
Figure 8.
General scheme of sMODiTS. Green arrows indicate paths to follow when the conditions are satisfied. Red arrows indicate paths to follow when conditions are not satisfied. Finally, the blue arrows represent the normal flow of the diagram.
Figure 8.
General scheme of sMODiTS. Green arrows indicate paths to follow when the conditions are satisfied. Red arrows indicate paths to follow when conditions are not satisfied. Finally, the blue arrows represent the normal flow of the diagram.
Figure 9.
Prediction power reached by (a) Márquez-Grajales et al.[
33] and (b) sMODiTS. The color bars represent the absolute discrepancy between the two approaches regarding prediction error.
Figure 9.
Prediction power reached by (a) Márquez-Grajales et al.[
33] and (b) sMODiTS. The color bars represent the absolute discrepancy between the two approaches regarding prediction error.
Figure 10.
Statistical comparison results between eMODiTS and every version of sMODiTS. Friedman test and Nemenyi post hoc were employed to perform this analysis with a 95%-confidence level.
Figure 10.
Statistical comparison results between eMODiTS and every version of sMODiTS. Friedman test and Nemenyi post hoc were employed to perform this analysis with a 95%-confidence level.
Figure 11.
Statistical results of comparing eMODiTS and sMODiTS by the Texas Sharpshooter plot and the Wilcoxon Rank Sum Test with a 95% confidence. Regions (A) and (C) determine the regions the eMODiTS approach outperforms sMODiTS regarding the F measure. On the other hand, regions (B) and (D) determine the regions where the sMODiTS approach outperforms eMODiTS. Finally, regions (C) and (D) are zones where the Wilcoxon Rank Sum Test showed a significant difference, and (A) and (B) are the opposite.
Figure 11.
Statistical results of comparing eMODiTS and sMODiTS by the Texas Sharpshooter plot and the Wilcoxon Rank Sum Test with a 95% confidence. Regions (A) and (C) determine the regions the eMODiTS approach outperforms sMODiTS regarding the F measure. On the other hand, regions (B) and (D) determine the regions where the sMODiTS approach outperforms eMODiTS. Finally, regions (C) and (D) are zones where the Wilcoxon Rank Sum Test showed a significant difference, and (A) and (B) are the opposite.
Figure 12.
Percentage reduction of the number of evaluations reached by sMODiTS compared to eMODiTS.
Figure 12.
Percentage reduction of the number of evaluations reached by sMODiTS compared to eMODiTS.
Figure 13.
Statistical comparison results between two versions of sMODiTS and ten SAX-based approaches. eMODiTS was used as a reference. Friedman test and Nemenyi post hoc were employed to perform this analysis with a 95%-confidence level.
Figure 13.
Statistical comparison results between two versions of sMODiTS and ten SAX-based approaches. eMODiTS was used as a reference. Friedman test and Nemenyi post hoc were employed to perform this analysis with a 95%-confidence level.
Table 1.
Datasets used in this research. This data was obtained from [
75]. The ‘Abbrev’ column is the authors’ suggested abbreviation for the database name.
Table 1.
Datasets used in this research. This data was obtained from [
75]. The ‘Abbrev’ column is the authors’ suggested abbreviation for the database name.
Table 2.
Parameter setting for eMODiTS and sMODiTS. The values were selected according to the reported in [
20].
Table 2.
Parameter setting for eMODiTS and sMODiTS. The values were selected according to the reported in [
20].
Parameter |
Value |
Population size |
100 |
Generation number |
300 |
Independent executions number |
15 |
Crossover rate |
80% |
Mutation rate |
20% |
Table 3.
Average prediction metrics achieved by Márquez-Grajales et al. [
33] and sMODiTS. Bold numbers represent the best values for each measure.
Table 3.
Average prediction metrics achieved by Márquez-Grajales et al. [
33] and sMODiTS. Bold numbers represent the best values for each measure.
Table 4.
Analysis results of the Pareto fronts obtained by eMODiTS and each version of sMODiTS using the performance measures HVR, Generational Distance (GD), coverage of the eMODiTS over sMODiTS (), coverage the sMODiTS over eMODiTS (), and convergence index (CI). The values displayed represent the average of each measure for all test databases. In addition, values in bold indicate the maximum values for each metric, while values in italics indicate the minimum values.
Table 4.
Analysis results of the Pareto fronts obtained by eMODiTS and each version of sMODiTS using the performance measures HVR, Generational Distance (GD), coverage of the eMODiTS over sMODiTS (), coverage the sMODiTS over eMODiTS (), and convergence index (CI). The values displayed represent the average of each measure for all test databases. In addition, values in bold indicate the maximum values for each metric, while values in italics indicate the minimum values.