1. Introduction
The classification problem is an NP-Hard problem that has many applications in medicine, industry, economics and other fields. Several types of classifiers have been proposed in the literature to solve this problem, including the approach called Support Vector Machine based on Quadratic Programming (QP) [
5,
6,
7]. The difficulty in the implementation of SVMs on massive data comes in the fact that the quantity of required computer storage required for a regular QP solver increases by an exponential magnitude as the problem size expands.
In this paper, we introduce a new version of SVM that implements a preprocessing filter and a recurrent neural network, namely Optimal Recurrent Neural Network Density Based Support Vector Machine (Opt-RNN-DBSVM).
SVMs approaches are based on the existence of a linear separator which can be obtained by exploding the data in a higher dimensional space through appropriate kernel functions. Among all possible hyperplanes, the SVM searches for the one with the most confident separation margin for a good generalization. This issue takes the form of a nonlinear constrained optimization problem that is usually handled using optimization methods. Thanks to the Kuhen-Tuker conditions, all these methods pass to the dual versions and call the optimizations methods to find the support vectors on which the optimal margin is built. Unfortunately, the complexity in time and memory grow exponentially with the size of data sets; in addition, the number of local minima grows too, which influences the location of the separation margin and the quality of the predictions.
A primary area of research in the area of learning from empirical data through support vector machines (SVMs) and addressing classification and regression issues is the development of incremental learning designs when the size of the training dataset is massive \cite{SMOTE}.
Out of many possible candidates, avoiding the usage of regular Quadratic Programming (QP) solvers, the two learning methods gaining attention lately are Iterative Single-Data Algorithms (ISDA) and Sequential Minimal Optimization (SMO) [
3,
5,
7,
9]. ISDAs operate from a single sample at hand (pattern-based learning) towards the best fit solution. The Kernel AdaTron (KA) was the primary ISDA for SVMs, using kernel functions to mapping data to the high dimensional character space of SVMs P [
2] and conducting AdaTron [
1] processing in the character space. Platt’s SMO algorithm is an outlier among the so-called decomposition approaches introduced in [
4,
6], operating on a 2-samples workset of samples at a time. Because the decision for the two-point workset may be determined analytically, the SMO does not require the involvement of standard QP solvers. Due to it being analytically driven, the SMO has been especially wildly popular and is the most commonly utilized, analyzed, and further developed approach. Meanwhile, KA, though yielding somewhat comparable performance (accuracy and computational required time) in resolving classification issues, has not gained as much traction. The reason for this is twofold. First, up till lately [
8], KA appeared to be restricted to classification tasks and second, it "missed" the flower of the robust theoretical framework. KA employs a gradient ascent procedure, and that fact also may have caused some researchers to be suspicious of the challenges posed by gradient ascent techniques in the presence of a perhaps ill-conditioned core array. In [
10], for a lacking bias parameter b, the authors derive and demonstrate the equality of two apparently dissimilar ISDAs, namely a KA approach and an unbiased variant of the SMO training scheme [
9] when constructing SVMs possessing positive definite kernels. The equivalence is applicable to the classification and regression tasks, and sheds additional insight in these apparently dissimilar methods of learning.
Despite the richness of the toolbox set up to solve the quadratic programs from SVM, and with the large amount of data generated by social networks, medical and agricultural fields, etc., the amount of computer memory required for a QP solver from the SVM dual grows hyper-exponentially and additional methods implementing different techniques and strategies are more than necessary.
Classical algorithms, namely ISDAs and SMO, do not distinguish between different types of samples (noise, border, and core) which causes searches in unpromising areas. In this work
we introduce a hybrid method to overcome these shortcoming, namely Optimal Recurrent Neural Network Density Based Support Vector Machine (Opt-RNN-DBSVM). This method proceeds in four steps: (a) characterization of different samples based on the density of the data sets (noise, core, and border), (b) elimination of samples with a low probability of being a support vector, namely core samples that are very far from the borders of different components of different classes, (c) construction of an appropriate recurrent neural network based on an original energy function making balance between the SVM-dual components (constraints and objective function) and insuring the feasibility of the network equilibrium points [
3,
4], and (d) solution of the system of differential equations, managing the dynamics of the RNN, using the Euler-Cauchy method involving an optimal time step. Due to its recurrent nature, the RNN was able to memorize locations visited during previous explorations. At one hand, two main interesting fundamental results were demonstrated: the convergence of RNN-SVM to feasible solutions and Opt-RNN-DBSVM has a very low time complexity compared to Const-RNN-SVM, SMO-SVM, ISDA-SVM, and L1QP-SVM.
On the other hand, several experimental studies were conducted based on well-known data sets. Based on well-known performance measures (Accuracy F1-score Precision Recal), Opt-RNN-DBSVM outperformed recurrent neural network-SVM with constant time step, Kernel-Adatron family of algorithms-SVM family, and well-known non-kernel models; In fact, Opt-RNN-DBSVM improved the accuracy, the F1Score, the precision, and the recall. Moreover, the proposed method requires a very small number of support vectors. The rest of this paper is organized as follows: In the second section, we give the flowchart of the proposed method. In the third section, we give the outline of our recent SVM versions called Density Based Support Vector Machine. The fourth section presents, in detail, the construction of recurrent neural networks associated with SVM-dual and the Euler-Cauchy algorithm that implements an optimal time step. In the fifth section, we give some experimental results. Finally, we give some conclusions and future extensions of Opt-RNN-DBSVM.
2. The architecture of the proposed method
The Kernel AdaTron (KA) algorithms, namely ISDAs, and SMO, treat different types of samples (noise, border, and core) in the same manner (all samples are considered for several iterations and supposed to be a support candidate with uniform probability), which causes searches in unpromising areas and increase the number of iterations. In this work, we introduce an economic method to overcome these shortcomings, namely Optimal Recurrent Neural Network Density Based Support Vector Machine (Opt-RNN-DBSVM). This method proceeds in four steps (see
Figure 1): (1) Characterization of different samples based on the density of the data sets (noise, core, and border); to this end, two parameters are introduced: the size of the neighborhood of the current sample and the threshold that permits such categorization; (2) Elimination of samples with a low probability of being a support vector, namely core samples that are very far from the borders of different components of different classes and the noise samples that contain wrong information about the phenomenon under study. In our previews work [
28], we demonstrate that such suppression does not influence the performance of the classifiers; (3) Construction of an appropriate recurrent neural network based on an original energy function making balance between the SVM-dual components (constraints and objective function) and ensuring the feasibility of the network equilibrium points [
3,
4]; (4) Solution of the system of differential equations, managing the dynamics of the RNN, using the Euler-Cauchy method involving an optimal time step. In this regard, the formula of the coming state of the neurons, of the constructed RNN, is introduced into the energy function, which leads to a 1-dimension quadratic optimization problem whose solution represents the optimal step of the Euler-Cauchy process that ensures a maximum decrease of the energy function [
37]. The components of the produced equilibrium point represent the membership degrees of different samples to the support vectors data set.
3. Density based support vector machine
In the following, we denote by BD the set of N samples labeled, respectively, by , distributed via class . In our case, and ∈ {−1, +1}.
2.1. Classical support vector machine
The hyperplane we are looking for must satisfy the equation , where w is the weight that defines this SVM separator that satisfies the constraints family given by . To ensure a maximum margin, you need to maximize .
As the pattern are not linearly separable, we introduce the kernel functions K (that satisfy the Mercer conditions [
9]) to explore the data in appropriate space. By introducing the Lagrange relaxation and writing the Kuhn-Tuker conditions, we obtain a quadratic optimization problem with a single linear constraint that we solve to determine the support vectors [
18]. To address the problem of sutured constraints, some researchers have added the notion of a soft margin [
8]. They employed N supplementary slack variables
at every constraint
. The sum of the relaxed variables is weighted and included in the cost function:
Here
represents the transformation function derived from the function kernel K. We obtain the coming dual problem:
Several methods can be used to solve this optimization problem: gradient methods, linearization method Frank-Wolf method, generation column method, Newton method applied to the Kuhn system [
23], sub-gradient methods, Dantzig algorithm, Uzawa algorithm [
23], recursive neural networks [
22], hill climbing, simulated annealing, search by calf, A*, genetic algorithm, ant colony [
24], and particle swarm method [
22] ... etc. Several versions of SVM were proposed in the literature among which we find the Least Squares Support Vector Machine Classifiers (LS-SVM) [
12], the generalized support vector machines (G-SVM) [
10], Fuzzy support vector machine [
2,
11], One class Support Vector Machine (OC-SVM) [
13,
18], Total Support Vector Machine (T-SVM) [
14], Weighted Support Vector Machine (W-SVM) [
15], Granular Support Vector Machines (G-SVM) [
16], Smooth Support Vector Machine(S-SVM) [
17], Proximity Support Vector Machine classifiers (P-SVM) [
19], Multisurface proximal support vector machine classification via generalized eigenvalues (GEP-SVM) [
20], and Twin support vector machines (T-SVM) [
21] ... etc.
2.2. Density based support vector machine (DBSVM)
In this section, we give a short description of DBVSM method. We introduce real number r > 0 and the integer mp> 0, called min-points, and we define three types of samples: noise point, border point and interior point (or cord point). We showed that the interior points do not change their nature even when they are projected into another space by the kernel functions. Furthermore, we have established that such points cannot be selected as support vectors [
28].
Definition 1. Let . A point is said to be an Interior Point (or cord point) of S if there such that . The set of all interior point of S is denoted by
Definition 2. For a given dataset BD, a non-negative real r and an integer mp, there exist three kind of samples.
A simple is called if .
A simple is called if and
A simple is called if and there exists a such as .
Let K be a kernel function allowing to move from the space to the space using the transformation
Lemma 1. [
28] If a is a
for a given
and minpoints (mp), then
is also a
with appropriate ϵ′ and the same minpoints (mp).
Theorem 1. [
28] A cord point is either a noise-point or a Border-point.
Proposition 1. [
28] Lets
be a real number. The cord point set,
decreasing function for the inclusion operator.
Let be the set of the Lagrange multipliers where BM, CM, and N M are the Lagrange multipliers of the border samples, cord samples, and noise samples, respectively.
As the elements of
and
can not be selected to be support vectors, the reduced dual problem is given by:
In this work, as the (RD) problem is quadratic with linear constraint, we use the continuous Hopfield Network by proposing an original energy function in the coming section [
26].
4. Recurrent neural network to optimal support vectors
The continuous Hopfield networks consist of interconnected neurons with a smooth sigmoid activation function (usually a hyperbolic tangent). The differential equation which governs the dynamics of the CHN is:
where
and I are, respectively, the vectors of neuron states, the outputs, the weight matrix, and the biases. For a CHN of
neurons, the state
and output
of the neuron i are relied on by the equation αi For an initial vector state
, a vector
is called an equilibrium point of the system 1, if and only if,
such as
. It should be noted if the energy function (or Lyapunov function) exists; the equilibrium point exists as well. Hopfield proved that the symmetry of the matrix of the weight is a sufficient condition for the existence of the Lyapunov function [
25].
3.1. Continuous Hopfield network based on the original energy function
To solve the obtained dual problem via recurrent neural network [
26], we propose the following energy function:
To determine the vector of the neurons biases, we calculate the partial derivatives of
:
The components of the biases vector are given by:
To determine the connection weights
between each neurons pair, we calculate the second partial derivatives of E:
The components of the weights
matrix are given by:
To calculate the equilibrium point of the proposed recurrent neural network, we use the Euler-Cauchy iterative method:
- (1)
Initialization: , and the step are randomly chosen;
- (2)
Given and the step , the step is chosen such that is maximum and we calculate using: , then we calculate using the activation function , then the are given by: , where P is the projection operator on the set
- (3)
Return to 1) until , where .
The Figure 2 shows the connection weights W between each pair of neurons.
Theorem 2. If , else , where
Proof of Theorem 2. We have and , then because
Thus
Finally, for , and for .
Concerning the constraint family satisfaction , we use the activation function.
, where is supposed to be a very large positive real number, which ensures .
We consider a kernel function such that .
Theorem 3. A continuous Hopfield network has an equilibrium points if and .
Theorem 4. If , then CHN-SVM has an equilibrium points.
Proof of Theorem 2. We have then because K is symmetric. In the other hand
Then CHN-SVM has an equilibrium points
3.2. Continuous Hopfield network with optimal time step
In this section, we chose, mathematically, the optimal time size in each iteration of the Euler-Cauchy method to solve the dynamical equation of the recurrent neural network proposed in this paper.
At the end of the iteration, we know and let be the next step size permits to calculate using the formula: must be chosen such as at maximum.
As the activation function of the proposed neural network is the , then:
The matrix form of the energy function is:
where
At the
iteration, the state
is known, then
is calculated by:
where
is the actual time step that must be optimal. To this end, we substitute
by
in
:
where
,
,
.
Thus the best time step is the minimum of . The Figure 3a–c gives different cases.
![Preprints 79798 i001](https://www.preprints.org/frontend/picture/ms_xml/manuscript/d64c047dc990b4d8d7d45854be6daf6d/preprints-79798-i001.png)
3.3. Opt-RNN-DBSVM algorithm
In this section, the procedures described in
Section 2.2, 3.1, and 3.2 are summarized into Algorithm 1.
Algorithm 1. Opt-RNN-DBSVM.
![Preprints 79798 i002](https://www.preprints.org/frontend/picture/ms_xml/manuscript/d64c047dc990b4d8d7d45854be6daf6d/preprints-79798-i002.png)
The input of Algorithm 1 is the radius r (the size of the neighborhood of the current sample), the minimum of samples mp into B(Current_sample, r) (that determines the type of this sample), the three lagrangian parameters
(that makes a compromise between the dual components), the bound C of SVM [
28], and the number of iterations (that represents an artificial convergence).
The Algorithm 1 processes in three macro-steps: Data preprocessing, RNN-SVM construction, and RNN-SVM equilibrium point estimation. The input of the first phase is the initial data set with labeled samples. Based on r and mp the algorithm determines the types of different samples based on the number of current sample neighbourhood discrete size. The output of this phase is a reduced sub-data set (The initial data set minus the core samples).
The Input of the second phase are the reduced data set, the Lagrangian parameters , and the SVM bound C. Based on the energy function built in 3.1 and on the first and second derivatives, the architecture of CHN-SVM is built; the bias and connection weights, which represent the output of this phase, are calculated. These later represent the input of the third phase and the Euler-Cauchy algorithm is used to calculate the degree of membership of different samples to the set of support vectors set; to ensure optimal decreasing of the energy function, at each iteration, an optimal step is determined by solving a quadratic 1-dimension optimization problem; see sub-section 3.2. At the convergence, the proposed algorithm produces the support vectors based on which Opt-RNN-DBSVM can predict the class of unseen samples.
Proposition 2. If N, r, and ITER represent, respectively, the size of a labeled data set BD, the number of the remained samples(output of the preprocessing phase, and the number of iterations, then the complexity of algorithm 1 is
Proof of proposition 2. First, in the preprocessing phase, we calculate, for each sample , the distance and we execute N comparisons to determine the type of each sample; thus the complexity of this phase is . Second, during ITER iterations, we update the activation of each neuron using the activation of all the other neurons to solve the reduced SVM-dual, thus the third phase has a complexity of
Finally, the complexity of algorithm 1 is Let’s denote Const- RNN-SVM the SVM version that implements recurrent neural network based on con- constant time step. Following the same reasoning, the complexity of Const-RNN-SVM is of
Notes: - As Kernel-Adatron Algorithm (KA) is the kernel version of SMO and ISDA and KA implements two embedded N-loops in each iteration, then the complexity of SMO and ISDA is of [
40]. In addition, we consider L1QP-SVM that implements numerical linearalgebra Gauss–Seidel method [
49], which implements two embedded N-loops in each iteration, then the complexity of SMO and ISDA is of
. - For a very high size and high-density labeled data set, we have:
thus
and
and
.
Thus and , and
4. Experimentation
In this section, we compare Opt-RNN-DBSVM to several classifiers: Const-RNN-SVM (SVM based on a recurrent neural network using constant Euler-Cauchy time step), SMO- SVM, ISDA-SVM, L1QP-SVM, and some non-kernel classifiers (for example MLP, NB, KNN, Decision Tree...). The classifiers were tested on several data sets: iris, abalone, wine, ecoli, balance, liver, spect, seed, and PIMA (collected from the University of California at Irvine (UCI)[
29]). The performance measures, used in this study, are accuracy, F1 score, precision, and recall.
4.1. Opt-RNN-DBSVM vs Const-CHN-SVM
In this section, we compare Opt-RNN-DBSVM to Const-RNN-SVM by considering different values of the Euler-Cauchy time step .
Table 1 and
Table 2 give different values of accuracy, F1-score, precision, and recall on the considered data sets. The results show the superiority of Opt-RNN-DBSVM over Const-CHN-SVM
; this superiority is quantified as follow:
where DATA is the set of different considering data sets. These results are not strange because Opt-RNN-SVM ensures an optimal decrease of the CHN energy function at each step.
Figure 4 and A1–A5 give the series of optimal steps generated by Opt-RNN-DBSVM during iterations for different data sets. We remark that all the optimal steps are taken from the interval [
3,
4] which furnishes an optimal domain for those using a CHN based on constant step.
4.2. Opt-RNN-DBSVM vs Classical-Optimizer-SVM
In this section, we give the performance of different Classical-Optimizer-SVM ( L1QP- SVM, ISDA-SVM, and SMO-SVM) applied to several datasets, and compare the number of support vectors obtained by the different Classical-Optimizer-SVM and Opti-RNN- SVM.
The
Table 3 gives the values of the measures of accuracy, F1 score, precision, and recall of Classical-Optimizer-SVM on different datasets. The results show the superiority of Opt-RNN-DBSVM.
Figure 5,
Figure 6,
Figure 7 and
Figure 8 illustrates, respectively, the support vectors obtained using L1QP-SVM, L1QP-SVM, SMO-SVM, and Opti-RNN-SVM applied to IRIS data. We note that (a) ISDA considers more than 96% as support vectors, which is really an exaggeration, (b) L1QP and SMO use a reasonable number of samples as support vectors, but most of them are duplicated, and (c) thanks to the preprocessing, Opt-RNN can reduce the number of support vectors by more than 32%, compared to SMO and L1QP, which allows it to overcome the over-learning phenomenon encountered with SMO and L1QP.
Figure 8 gives the support vectors obtained by Opt-RNN-DBSVM applied to IRIS data.
4.3. Opt-RNN-DBSVM vs non-kernel classifiers
In this section, we compare Opt-RNN-DBSVM to several non-kernel classifiers, namely NiaveBayes [
30], MLP [
31], Knn [
32], AdaBoostM1 [
33], DecisionTree [
34], SGDClassifier [
35], Nearest Centroid Classifier [
35], and Classical SVM [
1].
Table 4 give the values of the measures accuracy, F1-score, precision, and recall for the considered data sets. The best performance is reached by Opt-RNN-DBSVM followed by classical methods SVM, Decision Tree, and Adabsot are close to the Opt-RNN-DBSVM method.
Additional comparison studies were performed on PIMA and Germandiabetes data sets and the ROC curves were used to calculate the AUC for the best performance obtained from each non-kernel classifier.
Figure A6 and
Figure A7 show the comparison of the ROC curves of the classifiers DT, KNN, MLP, NB, and Opt-RNN-DBSVM method, evaluated on the PIMA data set. We point out that Opt-RNN-DBSVM quickly converges to the best results and obtains more true positives for a small number of false positives compared to several classification methods. More comparisons are given in
Appendix A;
Figure A8 and
Figure A9 show the comparison of the ROC curve of the classical SVM and Opt-RNN-DBSVM method, evaluated on the Germany diabetes data set.
5. Conclusions
The main challenges of SVM implementation are: the number of local minima and the amount of computer memory, required for solving the SVM-dual, increase exponentially with respect to the size of the data set. The Kernel-Adatron family of algorithms, ISDA and SMO, has handled very large classification and regression problems. However, these methods treat noise, boundary, and kernel samples in the same way, resulting in a blind search in unpromising areas. In this paper, we have introduced a hybrid approach to deal with these drawbacks, namely Optimal Recurrent Neural Network Density Based SupportVector Machine (Opt-RNN-DBSVM), which performs in four phases: Characterization of different samples, elimination of the samples having a weak probability to be support vector, building an appropriate recurrent neural network based on original energy function, and solving the differential equation system, governing the RNN dynamic, using Euler-Cauchy method implementing an optimal time step. Due to its recurrent nature, the RNN was able to memorize locations visited during previous explorations. On one hand, two main interesting fundamental results were demonstrated: the convergence of RNN-SVM to feasible solutions and Opt-RNN-DBSVM has a very low time complexity compared to Const-RNN-SVM, SMO-SVM, ISDA-SVM, and L1QP-SVM. On the other hand, several experimental studies were conducted based on well-known data sets (iris, abalone, wine, ecoli, balance, liver, spect, seed, pima). Based on credible performance measures (Accuracy F1-score Precision Recal), Opt-RNN-DBSVM outperformed Const-RNN-SVM, KAs- SVM, and some non-kernel models (cited
Table 4). In fact, Opt-RNN-DBSVM improved accuracy by up to 3.43%, F1Score by up to 2.31%, precision by up to 7.52%, and recall by up to 6.5%. In addition, compared SMO-SVM, ISDA-SVM, and L1QP-SVM, Opt-RNN-DBSVM a reduction of the number of support vectors by up to 32%, which permits to save of memory for huge applications that implement several machine learning models. The main problem encountered in the implementation of Opt-RNN-DBSVM is the determination of the Lagrange parameters involved in the SVM energy function. In this sense, a genetic strategy will be introduced to determine these parameters considering each data set.