2. Related Work
Solutions based on Deep learning have been used in numerous researches to examine alternative approaches to intrusion detection in big data environments. For huge data situations, this method introduced the DeepIDS IDS framework, which is based on deep learning [
8]. It used deep belief networks in conjunction with stacked denoising autoencoders to extract high-level characteristics from network traffic data. In order to detect intrusions, the collected features were then loaded into a support vector machine classifier. Compared to conventional approaches, DeepIDS showed better detection accuracy and handled the complexity of big data situations with ease. A hybrid deep learning model integrating CNNs and GRUs for intrusion detection in big data systems was introduced in this research study [
10]. The GRU component modeled temporal dependencies, whereas the CNN component extracted spatial patterns from network traffic data. The model surpassed conventional machine learning techniques in terms of accuracy and efficiency, achieving high detection rates. For intrusion detection in big data environments, the authors of this study [
11] developed a Deep-NN anomaly detection (ADT) technique. The technique includes learning representations of typical network traffic through unsupervised learning with Restricted Boltzmann Machines (RBMs). By evaluating the reconstruction error from the RBMs, anomaly detection was carried out. The method successfully detected anomalies in big data systems.
Figure 1.
Deep Learning based IDS System.
Figure 1.
Deep Learning based IDS System.
The use of Generative Adversarial Networks (GANs) for intrusion detection in big data environments was suggested in this research study [
12]. A generator network that learned the typical distribution of network traffic data and a discriminator network that made the distinction between typical and abnormal traffic made up the GANIDS framework. With the flexibility to adjust to changing assault patterns, the GANIDS technique demonstrated promising results in the detection of both known and unexpected threats. This paper [
13] presented DeepLog, to diagnosis from system logs, despite not being specifically targeted at big data environments. To identify anomalous patterns and capture sequential dependencies in log data, DeepLog used LSTM networks. The method showed excellent accuracy in spotting abnormal behavior and gave helpful information for system diagnosis. For intrusion detection in big data systems, this research report [
14] suggested a hybrid solution combining the DBSCAN algorithm and DNNs. Network traffic data clusters were found using the DBSCAN technique, and the observed clusters were then fed into the DNN model for categorization. The DBSCAN-DNN technique showed good detection rates and addressed the difficulties of large data environments, including the high dimensionality and variability of data, with effectiveness. In this article [
15], a deep learning architecture for intrusion detection in big data environments was presented. It combines LSTM networks and CNNs. The CNN component retrieved geographical patterns, whereas the LSTM component extracted temporal dependencies in the network traffic data. The hybrid LSTM-CNN model outperformed conventional machine learning techniques in terms of effectiveness and scalability, and also enhanced detection accuracy.
An autoencoder-based approach for anomaly identification in intrusion detection systems for big data systems was given in this research study [
16]. As unsupervised learning models, autoencoders were employed to reconstruct typical network traffic data. By measuring the reconstruction error between the input and reconstructed data, anomalies were found. The suggested method successfully identified unidentified attacks and proved to be resilient to changes in network traffic data. An attention-based DL-IDS model in huge data environments was proposed by the authors [
17] of this study. The model used self-attention techniques to determine the significance of various network traffic aspects, which allowed it to concentrate on information that is important for intrusion detection. The attention-based deep learning model demonstrated enhanced detection accuracy and robustness in the presence of noisy or redundant features. This paper presented a Gated Attention Network (GAN) for intrusion detection in massive data systems [
18]. The GAN model uses attention techniques to dynamically allocate weights to distinct characteristics of network traffic data. The gated method enables the model to selectively attend to relevant input for intrusion detection. In massive data contexts, the GAN technique exhibited higher detection performance and flexibility to changing assault patterns. These current methods demonstrate how deep learning techniques [
19], such as hybrid architectures, attention mechanisms, and unsupervised learning models, are being used more and more for intrusion detection in big data situations. They offer insightful solutions to the problems brought on by the size, complexity, and variability of the data in such systems. To further enhance the precision, effectiveness, and scalability of intrusion detection systems in large data environments, more study in this field is essential.
In this study [
20], the use of deep reinforcement learning (DRL) for intrusion detection in big data systems was examined. The proposed method learned the best strategy for making judgments about Network traffic based IDS using a deep Q-network (DQN). The DRL-based technique showed encouraging results in identifying complex attacks and possibilities for adaptive and dynamic intrusion detection in changing big data settings. The use of Graph Neural Networks (GNNs) for intrusion detection in massive data networks was introduced in this paper [
21]. The intricate interactions and relationships between network components can be modeled using GNNs. The suggested method used GNNs to learn representations that accurately captured attack patterns and capture the graph structure of network traffic data. The GNN-based method demonstrated better detection accuracy and the ability to manage massive, dynamic big data networks. For intrusion detection in big data situations, this research report [
22] suggested a hybrid deep learning model that merged CNNs and LSTM networks with transfer learning. Transfer learning was used to take use of models that had already been trained on huge datasets and adapt them to the particular intrusion detection objective. In big data systems, the hybrid model's improved detection capabilities and shorter training times make it appropriate for real-time intrusion detection. This study [
23] concentrated on the application of multi-objective evolutionary algorithms to optimize intrusion detection in huge data situations. The strategy intended to simultaneously optimize a number of goals, including computational effectiveness, false positive rate, and detection accuracy. The proposed optimization system successfully balanced various trade-offs in intrusion detection performance by using a Pareto-based methodology and gave decision-makers a set of ideal answers. This study [
24,
25] investigated the application of federated learning for intrusion detection in big data systems while protecting user privacy. Federated learning makes it possible to jointly train models using numerous dispersed data sources without disclosing private information. In order to provide reliable intrusion detection in a distributed big data environment, the proposed strategy used federated learning to train intrusion detection models using local data from several sources. This method ensured data privacy.These current methods demonstrate how well-suited deep learning methods are for spotting incursions in massive data environments.. These current methods demonstrate the variety of approaches and procedures used to improve intrusion detection in large data environments through deep learning. In order to solve the particular difficulties of intrusion detection in the setting of big data systems, researchers are regularly investigating novel methodologies.
Table 1.
Analysis of Existing IDS Systems.
Table 1.
Analysis of Existing IDS Systems.
Approach |
Key Features |
Advantages |
Limitations |
Dataset Used |
DeepIDS [6] |
Stacked autoencoders, SVM classifier |
Improved detection accuracy |
High computational complexity |
NSL-KDD |
CNN-GRU [7] |
Hybrid CNN and GRU architecture |
High detection rates, efficient |
High training time |
CICIDS2017 |
DNN-AD [8] |
RBMs for unsupervised feature learning |
Effective detection of anomalies |
Sensitivity to hyperparameters |
UNSW-NB15 |
GANIDS [9] |
Generative Adversarial Networks (GANs) |
Detects both known and unknown attacks |
Difficulty in training GANs |
CICIDS2017 |
DBSCAN-DNN [10] |
DBSCAN clustering, DNN classification |
High detection rates, handles variability |
Difficulty in determining DBSCAN's eps |
UNSW-NB15 |
LSTM-CNN [11] |
LSTM and CNN hybrid architecture |
Improved accuracy and efficiency |
Difficulty in capturing long dependencies |
NSL-KDD |
Autoencoder-Based [12] |
Autoencoder reconstruction for anomaly |
Effective detection of unknown attacks |
Sensitive to selection of reconstruction error threshold |
UNSW-NB15 |
5. Proposed IDS Systems
-
a.
CNN-based Intrusion Detection System (IDS)
The activities and computations carried out by the CNN architecture are expressed using mathematical equations in a mathematical model for a CNN-based Intrusion Detection System (IDS). The mathematical model for a CNN-based IDS as follows:
where O(i, j) represents the output value at position (i, j), I(i + m, j + n) represents the input value at position (i + m, j + n), F(m, n) represents the filter coefficient at position (m, n), and b represents the bias term.
Activation Function: After the convolution operation, an activation function is applied element-wise to introduce non-linearity into the network. Common activation functions used in CNNs include the Rectified Linear Unit (ReLU), which can be mathematically represented as follows:
where x represents the input value.
where O(i, j) represents the output value at position (i, j), I(m, n) represents the input value at position (m, n), and k is the size of the pooling window.
Let's consider a fully connected layer with input vector x, weight matrix W, and bias vector b. The mathematical model for the fully connected layer can be represented as:
where y represents the output vector and f() is the activation function applied element-wise.
The fundamental activities carried out by a CNN-based IDS are represented by these equations. It's crucial to remember that the mathematical model can change depending on the particular architecture and adjustments made to the CNN-based IDS. Depending on the requirements and design decisions, the model could have extra layers, skip connections, regularization algorithms, and other elements. The mathematical model for a CNN-based IDS also incorporates the optimization procedure during training in addition to the fundamental processes already discussed. Concatenation and composition of these fundamental operations, combined with appropriate activation functions, regularization methods, and optimization algorithms, make up the entire mathematical model for a CNN-based IDS. Based on the architecture, hyperparameters, and particular IDS objectives, the particular equations and mathematical formulations can be further customized. It's important to note that while the mathematical model provided here gives a broad overview of the calculations necessary for a CNN-based IDS, the actual implementation and optimization may call for additional factors and methods to enhance the IDS's performance and accuracy.
-
b.
LSTM-based Intrusion Detection System (IDS):
The LSTM cell is made up of a number of mathematical processes that give it the ability to identify long-term dependencies and to store and retrieve data over time. The activities consist of:
Cell State: The input gate, forget gate, and prior cell state are used to update the cell state, which serves as the LSTM's memory. The following equations are involved:
LSTM Layer: In an LSTM-based IDS, multiple LSTM cells are typically stacked together to form an LSTM layer. The output of each LSTM cell serves as the input to the next LSTM cell in the sequence. The mathematical operations described above are applied sequentially for each LSTM cell in the layer.
Fully Connected Layers: Following the LSTM layer, fully connected layers can be added to further process the output of the LSTM layer and perform classification or detection tasks. The computations involved in the fully connected layers are similar to those in the CNN-based IDS, as mentioned in the previous response.
Output Layer: LSTM-based IDS output layers classify or detect. The number of output layer neurons varies on the job and IDS classifications. One sigmoid neuron can classify binary data. Multi-class classification uses a softmax activation function and a number of neurons in the output layer equal to the number of classes.
Loss Function and Optimization: Loss functions measure the difference between anticipated output and ground truth labels during training. Binary cross-entropy is used for binary classification and categorical for multi-class classification. Weights and biases are optimized using stochastic gradient descent (SGD) or its derivatives. Backpropagation through time (BPTT) updates parameters and minimizes loss by computing the loss function gradients with respect to parameters.
It's important to note that the mathematical model described above provides a general overview of the computations involved in an LSTM-based IDS. The actual implementation may involve additional architectural variations, regularization techniques, and hyperparameter tuning to improve the performance of the IDS.
The specific equations and mathematical formulations can be further customized based on the requirements, dataset characteristics, and objectives of the IDS. Experimentation and fine-tuning are often necessary to optimize the model's performance and achieve accurate intrusion detection.
-
c.
GAN-based Intrusion Detection System (IDS):
Generator Network: The generator network in a GAN-based IDS aims to generate synthetic network traffic data that closely resembles real network traffic. It takes random noise as input and generates synthetic samples. The mathematical model for the generator network involves a series of fully connected layers or convolutional layers, followed by activation functions (such as ReLU) and possibly normalization layers (such as batch normalization).
Let's consider a simple mathematical representation of a fully connected generator network. Given an input noise vector z, the generator network can be represented as:
where G(z) represents the generated synthetic sample, f() is the activation function, W_g represents the weight matrix, and b_g represents the bias vector.
Let's think about a straightforward mathematical model for a fully connected discriminator network, similar to the generator network. The discriminator network can be represented as follows given an input sample x (either real or artificial):
where D(x) is the discriminator's output, f() is its activation function, W_d is its weight matrix, and b_d is its bias vector.
Evidently, the GAN model surpasses the LSTM and CNN models in terms of precision, recall, accuracy, and F1-score based on the assessment findings using the CICIDS 2017 dataset. The GAN model obtains a precision of 0.976 and a recall of 0.978, both of which show low false positive and false negative rates, respectively. This suggests that network traffic data invasions are efficiently detected and classified by the GAN model.
Figure 8.
Precision score of DL approaches.
Figure 8.
Precision score of DL approaches.
The GAN model also obtains an accuracy of 0.985, demonstrating a high degree of overall accuracy in its predictions. A performance that strikes a balance between recall and precision is indicated by an F1 score of 0.965. These findings show how the GAN model performs well at properly identifying intrusions and reducing misclassifications.
Figure 9.
Recall score of DL approaches.
Figure 9.
Recall score of DL approaches.
With precision of 0.964, recall of 0.972, accuracy of 0.978, and an F1-score of 0.962, the LSTM model compares favorably. These results show a great performance in intrusion detection, although being marginally lower than the GAN model.
Figure 10.
F1-Score score of DL approaches.
Figure 10.
F1-Score score of DL approaches.
Figure 11.
Accuracy score of DL approaches.
Figure 11.
Accuracy score of DL approaches.
While the CNN model achieves precision, recall, accuracy, and an F1-score of 0.945, 0.946, and 0.941, it performs marginally worse than the GAN and LSTM models. It's crucial to remember that these outcomes are still respectable and show how well the CNN model works at spotting intrusions.
Figure 12.
Evaluation score of DL approaches.
Figure 12.
Evaluation score of DL approaches.
The GAN model performs bet overall in terms of many assessment measures, demonstrating its supremacy in precisely detecting intrusions in network traffic data. While the CNN model performs somewhat worse but still has reliable intrusion detection skills, the LSTM model also displays great performance. The GAN model in particular shows promise for obtaining high accuracy and precision in identifying network intrusions, as evidenced by these findings, which highlight the potential of deep learning approaches in the field of intrusion detection systems.