The fundamental concept of Rocket methods is obtaining features from time series data and employing those features to train a classifier. These models use convolutional kernels to transform the time series data into features, which are then used for classification [
67]. Given a time series
, these algorithms compute features such as the maximum value (Max) and the proportion of positive values (PPV) for each of the
k convolutional kernels. The convolutional operation for a kernel
can be expressed as:
where ∗ denotes the convolutional operation,
is the output of the convolution, and
m is the kernel length. The Max and PPV features are computed as follows:
where
is the indicator function. The extracted features are then used to train a linear classifier for time series classification.
4.1. Empirical Mode Decomposition
For time series decomposition, feature extraction, and noise reduction the empirical mode decomposition (EMD) methods are applied [
70]. The variations of the EMD include the complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) [
71], empirical wavelet transform (EWT) [
72], and variational mode decomposition (VMD) [
73]. These methods are advanced signal processing techniques that aim to decompose a given time series into a finite set of components, each representing an intrinsic mode function (IMF) [
74] or oscillatory mode.
EMD is a data-driven method that decomposes non-linear time series in a set of IMFs [
75]. The main idea behind EMD is the so-called sifting process, which iteratively extracts IMFs by identifying local extrema and fitting envelopes using cubic spline interpolation. Given a time series
, the sifting process begins with the identification of all the local maxima and minima. Next, the upper and lower envelopes are created by the interpolation of the local maxima and minima employing cubic spline interpolation. The mean of the envelopes is then calculated as follows:
where
is the upper envelope and the
is the lower envelope [
76].
The difference between the original signal and the mean is considered a candidate IMF:
and this process is repeated on the IMF until it meets the predefined stopping criterion. Then, it is applied to the residual signal until all IMFs are extracted.
The EWT involves the decomposition of a given signal in oscillatory modes having varying scales and frequencies [
77]. The EWT algorithm produces a collection of
n non-linear functions known as IMFs from the signal
and a wavelet mother function
. The process for generating these IMFs is outlined in Algorithm 1.
Algorithm 1:EWT |
|
Once the set of IMFs is obtained, EWT employs a Fourier transform to each IMF to produce a set of
n spectrograms, that are utilized to visualize the time-frequency information of the signal [
78]. The EWT has the following expression:
where
is the
ith filter set as the convolution of the scaling function
and the
is scaled by a factor of
:
The EWT combines the concepts of EMD and wavelet transform. The main idea of EWT is to decompose the signal in a set of oscillatory modes using an adaptive filter bank. The filter bank is designed based on the signal’s time-frequency content, estimated by the continuous wavelet transform [
79]. The EWT decomposition is given by:
where
are the wavelet components,
N is the number of modes, and
is the residual.
VMD is another decomposition technique that formulates the extraction of IMFs as a constrained variational problem. VMD decomposes the time series in a set of band-limited IMFs by minimizing the cost function that balances the compactness of the frequency spectrum and the smoothness of the time-domain signal [
80]. The VMD optimization problem can be written as:
where
are the mode functions,
K is the number of modes, and
are the center frequencies of the modes.
4.2. Classification Methods
To evaluate the effectiveness of Rocket methods, including MiniRocket and MultiRocket, for classifying faults in insulators, a comprehensive analysis is conducted by combining these algorithms with various classifiers mentioned above. This experimental design aims to determine the best-suited combination of Rocket techniques and classification methods, ultimately enhancing insulator fault detection accuracy.
Logistic Regression: Logistic regression, a prevalent linear technique employed for classification, utilizes a logistic function to model the probability of a specific class or event [
81]. The following equation represents the logistic function:
Ridge Regression: Ridge regression, also known as Tikhonov regularization, is a linear regression technique incorporating an
regularization term to address the issue of multicollinearity and improve the generalization of the model [
82]. It is particularly useful when there are highly correlated features. The objective function for ridge regression can be written as follows:
where
is the weight vector,
b is the bias term,
and
are the true label and the feature vector for the
i-th instance, respectively, and
is the regularization parameter that controls the trade-off between model complexity and the goodness of fit. The regularization term,
, discourages the model from assigning large weights to the features, leading to a more stable and robust solution.
Decision Tree: Decision tree classifier, a non-parametric, hierarchical model, recursively partitions the input space into discrete regions according to feature values. The decision rules are derived by minimizing the impurity of the resultant nodes, which can be quantified utilizing metrics such as Gini impurity or entropy [
83].
The architecture of the classifier is built in the form of a tree structure where each internal node represents a feature or attribute, each branch represents a decision rule, and each leaf node represents a class label or a decision. According to Mishra et al. [
84], the architecture can be further improved by using clustering techniques.
k-NN: The k-nearest neighbors (k-NN) classifier, a non-parametric, instance-based learning algorithm, classifies novel instances based on the majority class of their
k nearest neighbors. The distance metric and the value of
k are crucial to the algorithm’s performance. Since it is a classification problem, employing an odd
k is more advantageous, thus avoiding draws [
85]. For this task, the weighted mode is denoted by:
where,
returns 1 if
[
86]. Given that
is the class of the
example associated with the
weight, and
c is the class with the best-weighted mode. To calculate the neighbors the Euclidean, cosine, correlation, chebychev, city block, spearman, standardized Euclidean, Minkowski, and Mahalanobis distances methods can be applied [
87].
LDA: Linear discriminant analysis (LDA), a technique utilized for dimensionality reduction and classification, identifies the linear combination of features that optimally separates distinct classes by maximizing the dispersion between classes and minimizing the dispersion within a class [
88]:
where
and
represent the between-class and within-class scatter matrix respectively.
Gaussian Naive Bayes: Gaussian Naive Bayes is a classification algorithm that is based on Bayes’ theorem [
89], assuming the features are conditionally independent and follow a Gaussian distribution:
where
A and
B are events or variables. The Gaussian Naive Bayes assumes that the features in the dataset are normally distributed and that they are independent of each other.
SVM: The support vector machine (SVM) classifier, endeavors to identify the optimal separating hyperplane between classes [
90]. Its performance is governed by the kernel function and regularization parameter
C:
Random Forest: An ensemble learning methodology constructs multiple decision trees and amalgamates their outputs via majority voting [
91]. The operator regulates the number of trees (
T) and their maximum depth. Let
be the set of input features and
be the set of output classes. The random forest classifier consists of
T decision trees,
, each grown to a maximum depth. Each tree is created by a randomly sampled subset of the train data, typically with replacement (i.e., bootstrapped samples), and a random subset of input features at each split. The random forest classifier is given in the following definition:
where
represents the final classification, again
returns 1 if
, and 0 otherwise, and
is the output of the
t-th decision tree for input
x.
Gradient Boosting: The gradient boosting classifier, an ensemble learning technique, sequentially builds weak learners, with each learner rectifying the errors committed by the preceding one [
92]:
where
denotes the boosted model at step
m,
signifies the weak learner, and
represents the step size. The gradient boosting method has also been utilized for prediction by various authors [
93,
94,
95].
AdaBoost: Adaptive boosting (AdaBoost) classifier, an adaptive boosting technique, combines weak learners to form a robust classifier, with each learner weighted based on its accuracy [
96]. The algorithm updates the weights of the training instances at each iteration, assigning greater importance to misclassified instances:
where
is the weight of instance
i at iteration
t,
is the prediction,
is the true label,
is the weight of the weak learner, and
is the normalization factor.
Gaussian Process: Gaussian process classifier, a Bayesian, non-parametric model, employs a Gaussian process prior over the function space and yields probabilistic classification results [
97]. It is determined by a mean
and a covariance function
:
XGBoost: The extreme gradient boosting (XGBoost) algorithm is a highly efficient and scalable tree-based ensemble learning model, designed for both classification and forecasting problems [
98]. It is an extension of the gradient boosting algorithm, employing advanced regularization techniques to improve the model’s generalization and control overfitting. XGBoost optimizes the following objective function:
where
represents the model parameters,
denotes the loss function comparing the true label
and the forecasted label
, and
is the regularization term for the
j-th tree. The regularization term comprises the tree complexity, measured by the number of leaves
T, and the
-norm of the leaf scores
w:
The algorithm employs second-order gradient information (Hessian) and the first-order gradient for updating the model, making the learning process more accurate and faster. Furthermore, it utilizes column block and sparsity-aware techniques to efficiently handle sparse data and parallelize the tree construction process, enabling it to tackle large-scale datasets efficiently [
99].
LightGBM: The light gradient boosting method (LightGBM), a boosting framework, leverages tree-based learning algorithms and is designed to be efficient and scalable for large datasets [
100]. It adopts gradient-based one-side sampling and exclusive feature bundling to expedite training and diminish memory usage.
In the next section, the results of the application of the proposed method are presented, initially the results of using different classifiers considering window sizes of 10, 50, and 100 records are presented. Then, the incorporation of Rocket, MiniRocket, and MultiRocket models with 10, 50, and 100-time steps are evaluated. And finally, the use of EMB methods to reduce noise that is not significant is evaluated.