1. Introduction
Power transformers are key components for the proper functioning of transmission and distribution grids. Although transformers are very reliable assets, the early detection of incipient degradation mechanisms is very important to prevent failures that may shorten their life spawn [
1,
2]. The life-cycle management of the power transformers is composed of several stages such as transformer specification, erection, commissioning, operation, maintenance, and end of life operations. In particular, for the last two stages, it is of paramount importance to have suitable tools for the assessment of power transformers condition. The economical con- sequences of a power transformer catastrophic failure causes: i. costs for the lost transmission of electricity and ii. direct costs of the power transformer that can vary according to the electrical system, substation topology and technical characteristic of the transformer. For example, consider the case for loss transmission capability due to a transformer failure of a single-phase unit of 230
kV, 33
MVA, located somewhere in Mexico. Then, the economic impact is composed by i. the costs for the loss of transmission which arises up to 6, 177.600
USD (since the cost for loss of transmission is around 2.6
USD/kWh in Mexico), and ii. the direct costs including a 72 hours affectation window (extinguish fires, damaged facilities repairment, soil remediation operations and reserve transformer testing and commissioning, and fitting all the substation and system conditions) with a direct cost around 1, 280, 000
USD. Therefore, grid operators and utilities are in heavy need for tools that allow them to optimize their decision-making processes regarding transformers repairmen, refurbishment, or replacement, under the umbrella of costing, reliability, and safety optimization [
3,
4].
Condition Assessment (CA) is the process of identifying markers and indexes to, determine and quantify, the degradation level of transformers components [
1,
5,
6]. Power transformers CA strategies includes exhaustive electrical and physicochemical testing, the usage of online and/or offline techniques, the analysis of operation and maintenance parameters, and the use of condition based strategies supported by current standards and expert knowledge. In fact, expert assessment is the most effective, costly, and time consuming CA strategy. It requires taking transformers offline and hiring experts for carrying out the analysis. Thus, utilities are looking forward for more cost-effective CA strategies where few or zero expert intervention is required.
One of the main steps of transformers CA is the identification of faults by a Transformers Fault Diagnosis (TFD) procedure. The TFD focus on the insulation system whose integrity is fully related with transformer reliability [
7]. The insulating system is exposed to electrical, mechanical, and thermal stresses. These phenomena are considered to be normal if they were considered by the transformer design characteristics, otherwise they are considered abnormal. Among the abnormal behaviors stand emergency overloading, arc flashes, transient events, thermal faults, to mention a few [
8,
9]. The transformer insulation system is divided into the insulating fluid (commonly mineral oil) and the solid insulation (kraft paper, pressboard, and other materials). Oil plays a very important role providing a highly reliable insulation, insulation and working as an efficient coolant, removing the heat generated at the core and windings during transformer operation [
10]. Further, the insulating oil can provide important information regarding transformer degradation and behavior at a very low cost, eliminating the necessity of carrying out expensive offline testing.
Transformer Insulating oil is a petroleum-derived liquid that can be based on isoparaffinic, naphthenic, naphthenic-aromatic, and aromatic hydrocarbons. No matter its structure, insulating oil can be decomposed by abnormal stresses, producing dissolved byproduct gasses correlated to specific faults. Hence, Dissolved Gas Analysis (DGA) is a widely studied diagnostic technique for which many tools are already available. These are based in the analysis of each byproduct gas, its concentration, and the interrelationship between them. Among the most classical methods to diagnose oil samples stands Rogers ratio, IEC ratio, Dornenburg ratios, Key gas method, Duval Triangles [
2,
10,
11,
12,
13,
14], and Duval Pentagons [
3,
15], to mention a few. Most of these methods are based on dissolved gas ratio intervals that classify transformers into different faults. However, these are prone to misinterpretations near the fault boundaries [
13,
14]. Furthermore, classical DGA methods always identify a fault even when there is not, thus, expert assessment is still required to accurately determine if there is a fault or not. On the other side, coarse DGA-based fault classification methods have high accuracy rate but have poor usability, whereas fine TFD can be used for decision-making but its accuracy rate is lower [
13]. In general, to decide whether to remove, repair, or replace a transformer in the presence of thermal faults, it is required to determine the fault severity [
15], thus, finer TFD is preferred. An important venue for TFD methods is Machine Learning (ML). These data-based algorithms have been proposed to improve TFD performance while avoiding the drawbacks earlier mentioned. ML methods provide high flexibility: are able to handle linear and non-linear relations, are robust to noise, do not necessarily require to take into account the thermodynamics phenomena, and provide high fault diagnosis performance [
16]. The ML algorithms that have been used for the TFD endeavor can be divided into supervised and unsupervised approaches. Supervised ML employs different gas-in-oil ratios already diagnosed by experts or chromatography, to build a function that relates these gas ratios with transformers faults or normal/faulty status. Unsupervised employs dissolved gases data to cluster into groups of transformers whose gases ratios are similar to each other. Nevertheless, expert’s diagnosis is always required to assess the performance of the models, thus, this study highlights the supervised approach. Most of the ML works applied to the TFD problem covers one or more of the following steps:
ML algorithms: several classifiers have been used such as as Artificial Neural Network (ANN) [
1,
10,
11,
13,
17], expert-guided ANN [
13,
18], Bayes Networks [
1], Decision Trees (DT) [
1,
11], Extremme Learning Machine (ELM) [
13], K-Nearest Neighbors (KNN) [
1,
10,
11,
17], Logical Analysis of Data (LDA) [
19], Logistic (LR) and regularized (LASSO) regression [
11], Probabilistic Neural Networks (PNN) [
10,
20], Softmax Regression (SR) [
13], Support Vector Machines (SVM) [
11,
17]; ensembles such as Boosting and Bagging [
1], eXtremme Gradient Boosting (XGBoost) [
21], Stacked ANN[
2]; even state-of-the-art algorithms such as few-shot learning with belief functions [
4].
Data pre-processing methods: several data transformations have been used such as data binarization [
19], key [
1,
2,
4,
11,
13,
20,
21] and custom gas ratios [
2], logarithmic transformation [
1,
2,
11], mean subtraction, normalization [
2], standardization [
1,
2,
11]; imputation of missing values using simple approaches [
22]; dimensionality reduction such as Linear Discriminant and Principal Components Analyses [
10], and belief functions [
4]; feature extraction such as Genetic Programming [
17]; knowledge- based transformations such as expert knowledge rules [
13] and oil-gas thermodynamics [
2].
TFD approach: the TFD problem has been posed as a binary classification (normal or faulty trans- former) [
1,
2,
11,
21]; as a multi-class classification with coarse [
17,
19] and fine fault types [
2,
10,
13,
20,
21], and even diagnosing faults severity [
4,
10].
Classes Imbalance: several strategies have been used for balancing such as bootstrap [
17], re-sampling the minority classes [
11,
13], classes subsetting [
11,
13], and assigning weight coefficients to the minority class [
21].
Parameters optimization: algorithm parameters have been optimized by hand [
2] and trial-and-error [
10]; exact methods such as Grid Search (GS) [
11,
17] and Mixed Integer Linear Programming [
19]; bayesian methods [
21]; and metaheuristics such as the Bat algorithm [
20], Genetic and Mind Evolutionary algorithms [
13].
Model overfitting assessment: the problem of overfitting the TFD classifiers has been handled through the usage of classical [
1,
11,
17] and stratified Cross-Validation (s-CV) [
21].
TFD Performance: algorithms performance has been determined using Accuracy percentage [
1,
2,
4,
10,
11,
13,
17,
19,
20], confusion matrix [
2,
10,
17], the Area Under Receiver Operating Characteristic (ROC) [
1,
2,
11] and Precision-Recall (PR) [
21] Curves, and the micro and macro F1-measure [
21].
Even while many works have been delved regarding the usage of ML algorithms for the TFD problem, these present one or more shortfalls such as i) training and testing their methods using small datasets, ii) carrying out comparisons using only classical ML supervised algorithms, iii) considering only coarse faults types by setting aside fault severity (not to mention none of the reviewed works considered fault severity as defined by Duval pentagon method), and iv) lack of publicly available data. These issues affect obtaining a clear idea of which sequence of methods and algorithms provides the best performance for the TFD problem, the reproducibility of the research results, and hampering the deployment of ML solutions to solve the TFD problem of real-world utilities.
The construction of high performance ML pipelines, regardless of its application, requires the involvement of data scientists and domain experts. This synergy allows to incorporate domain knowledge into the design of specialized ML pipelines (i.e., the sequence of data pre-processing, domain-driven feature selection and engineering, and optimized ML models for a given problem [
23]). However, the construction of specialized ML pipelines using this approach results in a long, expensive, complex, iterative, based on trial and error task. Derived from this analysis (and the related works), it is shown the difficulty of building intelligent models by operational process experts. These power systems experts can be easily overrun by the selection and combination of an ever-growing alternatives of pre-processing methods, ML algorithms, and their parameter optimization, for the solution of the TFD problem. Under these circumstances, the probability of obtaining a final ML pipeline that behaves sub-optimally is higher [
23,
24]. Hence, there is a growing necessity of providing power systems technicians with ML tools that can be used straightforward into power systems problems (e.g., TFD). In fact, the approaches used for automatically (without human intervention) and simultaneously, obtaining a high performing combination of data pre-processing, learning algorithm(s), and set of hyperparameters, are branded as Automatic Machine Learning (autoML) [
24,
25]. The autoML is a promising approach that may be used out-of-the-shelf for the TFD problem of real-world industries.
Therefore, in this work a deep comparative analysis of a large supervised ML algorithms pool composed of single, ensemble, and autoML classifiers applied to the TFD problem is presented. The purpose of this deep review is to compare algorithms performance for the TFD problem under the same experimental settings: i) compiling and sharing a Transformers Database (TDb) of the main dissolved gases data of 821 transformers, and their corresponding diagnostic; ii) using single and ensemble ML algorithms, as well as state-of-the-art autoML frameworks, for solving the TFD problem; iii) solving a real-world TFD multi-classification problem using, for the first time (to the best of authors knowledge), Duval Pentagons fault and severity classes [
26]. In doing so, this review provides a deeper comprehension of the ML approaches available for the TFD problem, and a view on how much automation can we expect for the TFD problem, particularly when fault severity is taken into consideration.
2. Materials and Methods
The overall ML system followed by the present work for the multi-class TFD problem is presented in
Figure 1. For the purpose of comparison, the pipelines used for single and ensemble classifiers were branded as
Standard ML Framework, whereas the pipeline used for autoML was branded as
AutoML Framework. In either way, a shared pipeline is specified for both ML approaches. The overall ML system consists of five major sections:
Data recollection and labeling, where transformers dissolved gas-in-oil and their corresponding diagnostic were recollected. Diagnostics are double checked, first by the Duval Pentagons method to obtain the fault severity (if not available), then, the IEEE C57.104-2019 standard and expert validation were used for identifying normal operating transformers.
Initial pre-processing, where gas-in-oil information were initially pre-processed following several methods studied in the literature, namely the replacement of zero measurements, natural logarithm escalation, and derivation of key gas ratios.
Split data into Training (i.e. Xtrain and Ytrain) and Testing (i.e. Xtest and Ytest) datasets. This splitting took into consideration each class proportion for avoiding leaving classes unrepresented in any of the datasets.
-
Train ML system
Standard ML framework, where a second data pre-processing stage, training, and parameter optimization was carried out. Parameters from, either single or ensemble classifiers, were optimized through the usage of a Grid Search (GS) and Cross-Validation (CV) procedures.
AutoML framework, where a warm-start procedure, additional data and feature pre-processing methods, and classifiers optimization and ensemble construction was carried out automatically.
Measuring test error using several multi-class performance measures, where the algorithms are comprehensively evaluated using several multi-class performance measures such as the κ score, balanced accuracy, and the micro and macro F1 − measure.
2.1. DGA Data
A Transformers Database (TDb) comprised by 821 transformers was gathered from different bibliographic sources. These samples were obtained from the specialized literature: a Db from the International Council on Large Electric Systems (CIGRE) Db, a Db from the IEEE [
27], technical papers [
15,
28,
29,
30,
31,
32], a CIGRE technical brochure [
33], and expert curation. For each transformer were collected the five so-called thermal hydrocarbon gases and, when reported, their corresponding diagnostic. The collected gases are hydrogen (H
2), methane (CH
4), ethane (C
2H
6), ethylene (C
2H
4) and acetylene (C
2H
2). When available, the associated diagnostic was also recovered from the bibliographic sources. Otherwise, it was obtained by means of an analysis method. In this paper, Duval Pentagons method [
15,
26] is selected due to the fact that it offers, not only fault types but also the severity for thermal faults. It is important to notice, that in some cases, this analysis method was also used to confirm the literature provided diagnostic.
Duval Pentagons Method [
26] first calculates the relative percentage ratios by dividing the concentration of each gas by the Total Gas Content (TGC). Then, the five relative percentage gas ratios are plotted in their corresponding axis in the Duval Pentagon, yielding to a non-regular five-sided polygon. The centroid of the irregular polygon provides the first part of the diagnostic, depending on the pentagon region where this is located. The diagnostic faults available in the first Duval Pentagon are Partial Discharges (PD), low and high energy discharges (D1 and D2 respectively), thermal faults involving temperatures less than 300°C (T1), thermal faults with temperatures ranging from 300 to 700°C (T2), and thermal faults involving temperatures higher than 700°C. There is an additional region in the first Pentagon called Stray Gassing Region (S), which reveals another type of gas generation mechanism. Stray gassing is associated to relative low temperatures, oxygen presence, and the chemical instability of oil molecules caused by a previous hydrogen treatment whose scope is the removal of impurities and undesirable chemical structures in mineral oils. The second part of the Duval Pentagons method allows the user to refine the diagnostic of each gas vector, providing advanced thermal diagnostic options: high temperature thermal faults that occurs in oil only (T3-H), different temperature thermal faults involving paper carbonization (T1-C, T2-C and T3-C), and overheating (T1-O).
However, all the available classical TFD methods (including the Duval Pentagon method) always provide a diagnostic, despite gas concentrations may been too low. In order to avoid those false positives, IEEE C57.104-2019 standard [
34] along with expert’s experience were used to tag the corresponding transformers with a normal condition diagnostic. The resulting classes distribution for the available dataset is shown in
Table 1.
2.2. Initial Pre-processing of DGA Data
Before any TFD can be carried out, either by the standard ML or the autoML frameworks, the TDb requires to be initially pre-processed. This pre-processing stage consisted of three steps: (i) the replacement of zero measurements; (ii) the scaling of measurement values using the natural logarithm function; and (iii) the derivation of features from dissolved gases ratios. The main reasons for carrying out an initial data pre-processing stage are two-fold. On one hand, data-preprocessing methods does improve the performance of standard ML frameworks for the TFD problem [
1,
2,
4,
11,
13,
20,
21]. On the other hand, autoML frameworks have put more attention in the selection of ML models and their HPO, than in the feature engineering (i.e., creation) and data-preprocessing methods [
23,
35]. Furthermore, the selected autoML algorithm used in this review, does not consider the pre-processing methods used in the proposed pipeline, nor a feature engineering method that can derive dissolved gases ratios from TDb sample measurements.
The initial pre-processing of DGA data is as follows. First, gases measurements whose reported values were zero were considered to be below the limits of detection of the chemical procedure analysis. Thus, for the zero measurements a small constant value was assumed for mathematical convenience (i.e., 1); particularly, for the C
2H
2 a smaller constant was considered (i.e. 0.1). Second, gases values were scaled using the ln function. This process is generally suggested for scaling features with positively skewed distributions (i.e., heavy-tails), which enable both to improve their normality and for reducing their variances [
36]. Third, feature engineering consisting in the estimation of different ratios from transformed gases values were obtained. The relationship between fault types and proportions of dissolved gases in the insulating system have been exploited by traditional DGA methods [
7,
21,
28]. Therefore, several relative ratios based on CH
4, C
2H
6, C
2H
4, C
2H
2, and H
2 were derived. This are shown in
Table 2. In this Table, THC (Total Hydrocarbon Content) stands for the sum of hydrocarbon gas contents, whereas TGC (Total Gas Content) stands for the total amount of dissolved gas contents in the transformer oil.
2.3. Splitting data and Training the ML system
Once data is initially prepared, it is splitted into training and testing datasets. This splitting was carried out taking into consideration classes proportions. Thus, each fault type is represented in both datasets, training (Xtrain, Ytrain) and testing (Xtest, Ytest). The proportions used for splitting the TDb were 70% for training and 30% for testing. Both subsets kept the same classes distribution ratios, as in the full TDb, to assess classifiers performance with imbalanced datasets. Afterwards, the ML systems were trained. Before delving into the details of both ML frameworks i.e., standard and autoML, it is worth to highlight that a second stage of data pre-processing was considered for convenience. This is, to avoid carrying out the same data pre-processing method (i.e., standardization) twice by the autoML approaches.
2.3.1. Standard ML framework
The standard ML framework follows a classical pipeline: i) data pre-processing, ii) selection of the classifier (either single or ensemble), iii) optimize classifiers parameters (using a GS-CV procedure). To complete the data pre-processing treatment, TDb gases measures are standardized by subtracting its mean and scaling values by their variance. Next, a classification algorithm is selected, either a single (ANN, DT, Gaussian Processes (GP), Na¨ıve Bayes (NB), KNN, LR, and SVM) or an ensemble algorithm. The main difference between single and ensemble classifiers is that, the first looks forward to obtain a robust model which attains a good generalization, whereas the second employs several instances of the same classifier. Usually, the classifiers composing the ensemble perform slightly better than a random classifier (e.g., by overfitting), and by using different combining strategies a good generalization is attained. Among the ensemble strategies stand Boosting (Bagging Classifier (BC), Histogram (HGB) and Extreme (XGBoost) Gradient Boosting), Bagging (Random Forest (RF)), and Stacking (SE). The stacked ensemble is a particular case, where two or more strong classifiers are sequentially chained. For this study, a ANN followed by SVM is employed.
Single and ensemble classifiers have been neatly discussed elsewhere, however, for the sake of completeness, they are briefly detailed in
Appendix A.1 and
Appendix A.2, respectively. On the other hand, in
Table 3 the parameters employed by single and ensemble classifiers are presented. The optimal values are estimated using a grid search cross validation procedure with k = 5 folds.
2.3.2. AutoML Problem and Frameworks
In accordance to [
23], an ML pipeline
h : X → Y involves the, computational-intensive and repetitive, sequential combination of algorithms that maps any given observation, i.e., x ∈ X, into a discrete (e.g., class) or continuous value, i.e., y ∈ Y.
To define an ML pipeline, i.e., h, lets first define its components. The set of specific algorithms, e.g., data pre-processing, feature selection, and classification, is defined as A = {A1, A2, . . . , An}. For each algorithm Ai, a configuration vector of hyperparameters is defined as λ(i) ∈ ΛA(i). Algorithms of A are connected with each other in accordance to a structure g. Such structure is defined as a Directed Acyclic Graph (DAG) where nodes represent algorithms and edges represent the data flow. The structure has implicit constrains (e.g., an imputation algorithm must precede a classification one), thus, it belongs to a set of valid pipeline structures G; its cardinality (i.e., number of consecutive sequential algorithms) is given by |g|. Therefore, an ML pipeline is formally given by P(g, A, λ) where, g ∈ G stands for the structure of the valid pipeline, the vector A ∈ A|g| stands for the algorithms selected for each node, and the vector λ ∈ ΛA stands for the hyperparameters for each of the selected algorithm in the pipeline.
Therefore, given a problem defined by the i.i.d. samples
D = {(x
1,
y1), . . . , (x
m,
ym)} drawn from the joint probability distribution
P (X, Y), an ML pipeline is created by finding the structure, algorithms, and their corresponding hyperparameters which minimizes the Empirical Pipeline Performance (EPP)
such as:
where
h(
xi) is the predicted output of the pipeline P, and L is a loss function. Further, to avoid overfitting P cross-validation is considered. Hence, D is splitted into k disjoint folds, this is {
Dvalid(1) , …,
Dvalid(k)} and {
Dtrain(1) , . . . ,
Dtrain(k)}. By rewriting Eq. 1 to include these, the final objective function is obtained:
To minimize the cost function presented in Eq. 2, three sub-problems must be addressed altogether: the i) Structure Search, the ii) Algorithms selection, and iii) algorithms HyperParameter Optimization (HPO). On one hand, in a recent survey [
23] it was stated that most of the autoML frameworks avoid solving the structure search by following a best practice fixed pipeline structure. This approach removes the burden of determining the graph structure g in Eq. 2. On the other hand, the algorithms selection and its HPO are simultaneously determined by solving the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem [
23,
24,
25]. Solving the CASH problem is similar to solving Eq. 2. If the pipeline structure g is fixed i.e., |
g|= 1, the CASH problem is defined as
To simultaneously consider which sequence of algorithms use and their corresponding hyperparameters, λr vector is defined. This allows to map the sequence of algorithms into the Λ configuration space such that
In consequence, the CASH minimization problem is stated as
Eq. 4 is a mixed-integer nonlinear optimization problem. Its solution involves finding algorithms numerical or categorical hyperparameters, which are mandatory or conditional (whose value depends on the selection of other hyperparameters) [
23].
To solve all of these numerical cruxes, several autoML frameworks based on classical ML and ensembles [
23,
37], as well as neural networks deep learning frameworks [
38], have been proposed. Due to the infancy of the autoML area, recent reviews reveal that most of the available autoML tools obtain competitive, but similar, results across several ml tasks [
23,
35]. Therefore, we selected the
auto-Sklearn algorithm, which is one of the first autoML frameworks, and provides robust and expert-competitive results in several ml tasks [
23,
25,
37]. The auto-Sklearn is an algorithm based upon Python scikit-learn (sklearn) library [
39]. It is employed for building classification and regression pipelines searching over classical ML and ensemble models. This algorithm explores semi-fixed structured pipelines by setting an initial fixed set of data cleaning steps. Then, a Sequential Model-based Algorithm Configuration (SMAC) using a Bayesian Optimization in combination with a Random Forest regression, allows the selection and tuning of optional pre-processing and mandatory modeling algorithms. Also, Auto-sklearn provides parallelization features, meta-learning to initialize the optimization procedure, and ensemble learning through the combination of the best pipelines [
23,
25,
37].
To improve the analysis between standard and autoML frameworks, two autoML versions are considered, namely, vanilla auto-Sklearn and robust auto-Sklearn models. The main differences between these two are: i) the vanilla model only considers a single regression model whereas the robust model employs an ensemble, and ii) the vanilla model does not employ the meta-learning warm-start stage to initialize the optimization procedure, whereas the robust model does employ it. In this sense, the vanilla model serves as a baseline for the autoML framework.
2.4. Classification Performance Metrics
In order to compare the performance of the Standard and the AutoML algorithms, several multi- classification metrics are employed. As mentioned before, several classification metrics have been employed for the analysis of algorithms performance for the TFD problem (i.e., the accuracy percentage, confusion matrix, the Area Under the Receiver Operating Characteristic (AUCROC) and Precision-Recall (AUCPR) Curves, and the micro and macro F1-measure. However, neither the accuracy percentage nor the AUCROC, are sensitive to classes imbalance. Further, neither the AUCROC nor the AUCPR are suitable for analyzing a multi-classification problem. Therefore, in this work the Confusion Matrix (CM), the Balanced Accuracy (BA), the F1-measure (F1) using micro and macro averages, the Cohen’s Kappa (κ) metric, and Matthews Correlation Coefficient (MCC) are employed.
On one hand, the CM is a tool to a understand the errors of classifiers in binary, multi-class, and even multi-label scenarios. On the other, the remaining performance metrics used in this work are obtained from it. The selected metrics are usefulness for assessing the overall performance of a classifier in a multi-class problem. From these, MCC and κ (and in a lower sense, F1-macro) are more robust than the remaining for assessing the expected performance of classifiers in the presence of classes imbalance.
2.5. Software
All the experimentation required for TFD ML algorithms comparison, i.e., pre-processing, training, and testing were conducted using the programming language Python in a Jupyter notebook. Standard Python packages such as numpy [
40] and pandas [
41] were used for the initial pre-processing stages. For training classical and most of the ensemble ML algorithms, the sklearn [
39] package is employed (in the case of xGB the xgboost [
42] package is used). For the AutoML case, the autosklearn package is used [
25]. The computer notebook is available at a Github repository.
It is worth to note that, while it would be a good idea to use the MCC and κ as a cost function for training the algorithms, due to sklearn package limitations1, algorithms training cost function is restrained to F1-macro.
3. Results
This section presents the TFD classification results obtained for algorithms of both, the standard and the autoML frameworks. For each classifier, five (5) performance metrics are calculated (as described in the above section). Using these metrics, a quantitative comparative analysis is carried out to determine the best algorithm(s). To have a deeper analysis of the performance for the rest of the algorithms, a Multi-Objective Decision Making (MODM) comparison is carried out. Afterwards, classes imbalance, false positives, and false negatives of the best performing algorithm are analyzed through the CM.
3.1. Overall classifiers performance for the TFD problem
In
Table 4, the performance of standard and the autoML frameworks results for the five quality metrics are presented. Best performing solutions are highlighted in bold. Observe that, in general, the best performing algorithm is the robust auto-Sklearn model for the five quality metrics. This model outperformed the rest of the algorithms, particularly, for the F1-macro measure where the closest competitors (ANN and SE models) attained approximately 10% lower F1-macro scores. These results show the ability of the robust auto- Sklearn model for handling the imbalanced TDb, providing the highest classification performance among all the tested algorithms, using the minimum tuning effort by the human-in-loop (i.e., electrical experts carrying out a TFD). Therefore, the robust auto-Sklearn model should be preferred for an out-of-the-shelf solution for the TFD problem.
MCC. Therefore, to improve the performance comparison, metric results for each algorithm are transformed using vanilla auto-Sklearn result as a baseline as follows:
where
Mi(A) corresponds to the
i metric result for algorithm A,
Mi(auto − Sklearn vanilla) corresponds to the i metric result for the vanilla auto-Sklearn model, and
Ni(
A) corresponds to the baseline transformed value for the
i metric and algorithm A. For instance, for BA and ANN the baseline transformed value
B˜A(
ANN) is obtained such as
. The transformed values can be interpreted as follows, an
Ni(A) > 0 value implies that the performance of
A algorithm is better than the vanilla auto- Sklearn algorithm. In contrast, if
Ni(
A) < 0, then, the
A algorithm performance is worse than the vanilla auto-Sklearn algorithm.
Once metric values are transformed, a MODM comparison is carried out. MODM deal with problems where two or more performance criteria are used altogether for taking a decision: in our case, looking forward for the algorithm which is able to identify specific electric transformer faults, as accurate as possible, in terms of five performance metrics. In a MODM, models quality is defined by a n-dimensional vector where n corresponds to the number of metrics used. Hence, an algorithm solving a MODM must consider either, a way to simplify a vector of quality metrics into a single scalar, or a way to handle multiple objective functions all at once.
Regarding the methods that solve a multiple objective functions all stand the Pareto Approach (PA) [
43]. In the PA, instead of handling the burden of collapsing multiple metrics into a single value, PA looks forward finding a set of solutions (e.g., TFD classification algorithms) that are
non-dominated. To define this concept, is easier first to define the opposite, dominance. A solution
si is said to
dominate sj iff,
si is strictly better than
sj in at least one of the quality metrics
ci,
i = 1, . . . ,
n, and equal or better in the remaining metrics. Formally, this is i) ∃
ci |
ci(
si) >
cj(
si) and ii) ∀
ci |
ci(
si) ≥
cj(
si) (where
ci(
si) stands for the quality metric value for solution
si) [
43]. On the other hand, two solutions
si and
sj are said to be non-dominating with respect to each other iff: i) quality metric values for solution
si are strictly better than
sj in at least one of the
ci,
i = 1, . . . ,
n, and ii) quality metric values for solution
si are strictly worse than
sj in at least one of the quality metrics
ci,
i = 1, . . . ,
n. The set of
non-dominated solutions is also known as the Pareto frontier. In Figure 2 the Pareto analysis carried out on the vanilla transformed quality metrics, excluding the robust auto-Sklearn model, is shown.
Figure 2.
Models fault classification performance.
Figure 2.
Models fault classification performance.
Observe that the vanilla auto-Sklearn model is shown at the origin (0,0); algorithms in the pareto frontier are depicted in red, whereas the worst performing algorithms are displayed in blue. From this figure note that the SE, ANN, and GP algorithms performed better than the vanilla auto-Sklearn (for BA the improvements were 3%, 3%, and -10%, respectively, whereas for κ the improvements were 3%, 2%, and 4.5%, respectively). Hence, and without considering the robust auto-Sklearn algorithm, either of these can be selected for the TFD problem. On the other side, HGB and SVM, while performed better for the κ metric than the vanilla auto-Sklearn (3% and 2% respectively), could be considered as good as the vanilla auto-Sklearn model, in a Pareto front sense (and in a lesser sense, the RF case). The remaining algorithms should be considered to perform worse than the vanilla auto-Sklearn model. Specifically, LR and NB algorithms performed considerably worse than the vanilla model: for the BA metric 17% and 14% worse, and for the κ metric 5% and 14%, respectively. In summary, single autoML frameworks provide a good identification of transformers faults with minimal human intervention, still, classical ML approaches such as ANN, SE, or GP classifiers would provide better results for the TFD problem.
3.2. Transformers fault diagnosis in detail
In accordance to the above results, the overall best performing algorithm for the TFD problem is the robust auto-Sklearn algorithm. However, how was its performance for each transformer fault type? and how its performance is compared against one of the algorithms belonging to the Pareto frontier such as the SE algorithm? In
Figure 3 confusion matrix for both algorithms are presented: in
Figure 3a the robust auto-Sklearn is shown, whereas in
Figure 3b the SE is displayed. Observe that, in general, for both algorithms most fault types were identified with a good (≥ 80%) to very good (≥ 90%) accuracy, excepting: in (a) for PD and S with an accuracy of 71% and 78%, respectively; in b) for S, T2-C, and T3-C, with an accuracy of 78%, 71%, and 75%, respectively. To examine the regular performance on these fault types it is useful to recall that when analyzing the performance of an algorithm using the multi-class CM (see section
Appendix B.1), rows indicate FN and columns indicate FP, respectively. Thus, for the case of the robust auto-Sklearn algorithm, PD faults were misclassified 29% of the times as S fault types; S faults were misclassified 19% of the times as T1-O fault and 3.7% as Normal condition. For the case of the SE algorithm, S faults are misclassified 22% times as a T1-O faults; T2-C faults are misclassified 14% of the times as T1-O and T1-C each; T3-C faults are misclassified 12% of the times as T3-H and S each. From all of these errors, the robust auto-Sklearn algorithm incurs in the most expensive ones (i.e., classifying a fault as a normal condition). Further, the misclassification from both algorithms can be attributed to the fault regions described by these for each fault type. These do not necessarily match the Duval pentagon fault regions, which are geometrically contiguous and do not overlap [
15]. In addition, recall that all of these classes, i.e., PD, S, T2-C, and T3-C, are under represented in the DB (see
Table 1). In the light of these, we can conclude that samples misclassified may lay at the classes limits, and/or class boundaries found by the algorithms have a different geometric shape that the one defined by the Duval pentagon. Therefore, increasing the sample size of imbalanced classes (either real or synthetic samples) should be useful for improving the boundaries defined in the feature space for each class by both algorithms. Finally, it is worth to note that both algorithms classified with 100% of accuracy the low thermal faults involving paper carbonization (i.e., T1-C), which is the most under represented class in the DB.
3. Conclusions
This paper presents a review and comparative analysis, of classical Machine Learning algorithms (such as single and ensemble classification algorithms) and two automatic machine learning classifiers for the fault diagnosis of power transformers. The purpose of this work is to compare under the same experimental settings the performance of classical ML classification algorithms, that requires a human-in-the-loop expert for tuning these, and two autoML approaches which requires almost zero human operation. For such purposes, transformers faults data was gathered from literature and Mexican and foreign utilities and test laboratories databases. Then, raw data was curated, specifically faults were validated and assigned using both the Duval pentagon method and expert knowledge. The methodology used for comparison included: i) several pre-processing steps for feature engineering and data normalization; ii) different ML approaches (single ML and ensemble algorithms were trained and tuned using a GS-CV by a data scientist, whereas, the autoML models were trained and tuned using a bayesian optimization in combination with a Random Forest regression with zero human intervention); iii) several algorithms performance approaches using global metrics, a pareto front analysis, and a CM to have a detailed looked into the type of biases algorithms suffer. A key contribution of this work is that it defines for the first time (to the best of authors knowledge), fault classes using Duval Pentagons and severity classes.
Figure 2.
Confusion matrix for the (a) robust auto-Sklearn model and (b) for the stacking ensemble algorithms.
Figure 2.
Confusion matrix for the (a) robust auto-Sklearn model and (b) for the stacking ensemble algorithms.
Results showed that the robust auto-Sklearn achieved the best global performance metrics over either single or ensemble ML algorithms. On the other side, the PA showed that the vanilla autoML approach performed worse than some single (ANN, SVM) and ensemble (SE, HGB, GP, and RF) ML algorithms. Even the CM revealed that, while the robust auto- Sklearn algorithm obtained the highest global performance metric values, it misclassifies some TF as a normal condition. This type of error can have a very negative impact in power grid performance (blackouts) with high economical costs. Nevertheless, the misclassification can be attributed to the imbalanced BD. Increasing the sample size of the imbalanced classes (either real or synthetic samples) should be useful for improving the boundaries defined in the feature space for each class. In conclusion, the robust auto-Sklearn model is not only a good out-of-the-shelf solution for the TFD while handling imbalanced datasets, but also achieved the highest global classification performance scores using the minimum tuning effort by a human (i.e., electrical experts carrying out a fault diagnosis). As future work, this model will be incorporated into a power transformer condition assessment of a maintenance management system. It is expected that, failure classification indicating the most probable defect will be used to help engineers to reduce the time needed to find and repair incipient faults, avoiding catastrophic failures and fires.