1. Introduction
In today’s world, the Internet has become an invaluable tool, effortlessly integrated into human life. People all over the world utilize it as a communication and information exchange medium. Information and communication technology (ICT) is essential in both business and daily life. However, in the age of big data, cyber-attacks on ICT systems have become increasingly sophisticated and broad, making network risks a key issue in modern life. Malicious attacks are continually developing, emphasizing the critical need for improved network security solutions. Given the world’s growing reliance on digital technologies such as computers and the Internet, building safe and reliable programs, frameworks, and networks that can withstand these attacks is a critical task [
1,
2].
Intrusion detection systems (IDS) are critical for protecting computer networks. They effectively recognize and respond to security threats. Intrusion it used to detect irregularities in network traffic to improve security. Detection accuracy, detection times, false alarm alerts, and the identification of unknown assaults are currently issues for IDS technology [
3] They are classified into three types: signature-based systems, anomaly-based systems, and hybrid systems. Anomaly-based systems can detect unknown hostile actions by recognizing deviations from a model based on typical behavior, whereas signature-based systems can identify known assaults by employing established signatures. Signature-based systems, on the other hand, have a high rate of false alarms [
4]. Existing anomaly intrusion detection systems have accuracy problems. Certain datasets lack network traffic diversity and volume, others lack diverse or recent attack patterns, and still others lack crucial feature set metadata. The hybrid IDS, which includes both anomaly-based and misuse-based IDSs, proved to be a more robust and effective solution. Network Intrusion Detection Systems (NIDS) are critical in resolving security issues. NIDS monitors network traffic for unusual activity, then analyzes the data to discover security breaches such as invasions, misuse, and anomalies. NIDS must deal with difficulties like large data dimensionality and high traffic volumes [
5]. While many research projects have used machine learning techniques approaches are useful in NIDS, they have limits when confronted with large amounts of network data. Feature selection (FS) has become widely used in selecting relevant features for building strong models. It has significantly influenced the efficiency and performance of IDS models [
6]. As a result, three critical aspects of NIDS development are preprocessing, feature reduction, and classifier methods. Nonetheless, network intrusion detection systems encounter issues such as managing massive amounts of data, high false alarm rates, and skewed data.
Machine learning techniques (ML) have been widely. It used in the field of information security in recent years. ML have found widespread application in network security during the last two decades [
7]. ML approaches are becoming more popular as a method of spotting anomalies [
8]. ML includes automating the process of learning from examples. It is used to build models that distinguish between regular and aberrant classes [
9].
The goal of this study was to find the most effective classifier by Methods for preprocessing and feature selection are translated into machine learning approaches that are extensively used by we in intrusion detection systems. Popular classification algorithms such as eXtreme Gradient Boosting (XGBoost), Classification and Regression Trees (CART), Decision Tree (DT), k-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), Random Forest (RF), Logistic Regression (LR), and Naïve Bayes (Bayes) are included. The evaluation of performance encompasses several dimensions as nine important criteria: In average accuracy k-fold cross-validation, accuracy, precision, recall, F1 score, PCC/BA, MCC, ROC, and average were calculated. Classification CPU Time and Model Size.
The following are the main contributions of this study:
Investigation of large amounts of data linked with harmful network activity.
Identification of feature dimensions influencing classification performance in a labeled dataset with both benign and malicious traffic, resulting in improved detection accuracy.
Use of the CSE-CIC-IDS-2018 dataset for NIDS and testing of seven different machine learning classifiers and scripts for identifying various sorts of assaults.
In general, researchers frequently work with incomplete data. In contrast, this study uses all accessible DDoS data in the experiment, correlating with reality by adopting the concept of data imbalance.
Presenting various performance assessments has many elements. Furthermore, the evaluation considers CPU processing time, which is an important component in intrusion detection, as well as the size of the experimentally obtained model, which has the possibility for future extension.
The rest of the paper is structured as follows:
Section 2 describes the research sequence, as well as the research concept and process. The methodology and proposed framework are described in
Section 3. The experimental setup is described and defined in
Section 4. The experiments and related discussions are presented in
Section 5. Finally,
Section 6 concludes the essay by discussing the model’s strengths and flaws and suggesting future study directions.
4. Experimental Setup
This study used a 64-bit Windows operating system (Windows 11) with the following specifications: an 11th Gen Intel(R) Core(TM) i7-11800H at 2.30GHz, 32 GB of 2933 MHz DDR4 Memory. The Python 3.11 environment was utilized, and the recommended model was implemented and evaluated using the Numpy, pandas, and sklearn data preparation tools. Pandas and Numpy libraries were used for data handling, preprocessing, and analysis, while Scikit Learn was used for model training, evaluation, and evaluation metrics. The Seaborn package and Matplotlib were used to visualize the data. The subsections that follow go into greater detail.
4.1. CSE-CIC-IDS-2018 Data Set
The data set given for the CSE-CIC-IDS-2018 [
16] was developed through a collaborative project between the Communications Security Establishment (CSE) and the Canadian Institute for Cybersecurity (CIC). It was created with the goal of evaluating intrusion detection research, and it has now become a benchmark data set for the evaluation of IDSs. The data was obtained a ten-day period, eighty columns, and There are fifteen sorts of attacks: FTP-BruteForce, SSH-Bruteforce, DoS attacks-GoldenEye, DoS attacks-Slowloris, DoS attacks-Hulk, DoS attacks-SlowHTTPTest, DDoS attacks-LOIC-HTTP, DDOS attack-HOIC, DDOS attack-LOIC-UDP, Brute Force-Web, Brute Force-XSS, SQL Injection, Infilteration, Label and Bot. The study focuses on DDoS intrusions [
14], because a difficult type of assault to counter. We found and use that the data contained DDoS intrusion types for 2 days, namely 02-20-2018.csv and 02-21-2018.csv.
4.2. Data Preprocessing
4.2.1. Data Cleaning
We chose two unique days’ datasets, 02-20-2018.csv (with 84 features) and 02-21-2018.csv (with 80 features). To normalize the dataset, we reduced it to 80 features after removing the first four: Flow ID, Src IP, Src Port, and Dst IP. These excluded attributes from the two days were combined. Machine learning models were created using the remaining 84 features and compared to models created using the reduced 80 features. The primary focus was on DDoS attacks. The labels were divided into four categories, yielding a dataset of 8,997,323 rows and 80 columns. Label 0 denotes benign, Label 1 denotes DDoS attacks-LOIC-HTTP, Label 2 denotes DDOS attacks-HOIC, and Label 3 denotes DDOS attacks-LOIC-UDP.
4.2.2. Exploratory Data Analysis
The analysis included determining the minimum, maximum, standard deviation, and mean values of the data for all 80 attributes, including the labels. Bwd PSH Flags, Fwd URG Flags, Bwd URG Flags, CWE Flag Count, Fwd Byts/b Avg, Fwd Pkts/b Avg, Fwd Blk Rate Avg, Bwd Byts/b Avg, Bwd Pkts/b Avg, and Bwd Blk Rate Avg were all removed. Additionally, features of the Timestamp type defined as Object were eliminated to improve classification appropriateness. To prepare the dataset for classification, the feature with the Timestamp type classified as Object was eliminated. This change was made to improve the dataset’s usability for classification applications. The initial dataset contained 8,997,323 rows and 69 characteristics. To assure data quality, several procedures were done, including the removal of NaN values (36,767 rows), the elimination of +inf and -inf values (22,686 rows), and the deletion of duplicate rows (2,302,927 rows). The dataset was refined to 6,634,943 rows after cleaning operations that included deleting NaN, +-inf values, and duplicates, making it ready for future study and use.
The
Table 1 displays data statistics before and after cleaning. Initially, there were 8,997,323 rows grouped into different labels, with “Benign” accounting for 85.95% of the records, “DDoS attacks-LOIC-HTTP” accounting for 6.40%, “DDOS attack-HOIC” accounting for 7.62%, and “DDOS attack-LOIC-UDP” accounting for 0.02%. The dataset was cleaned and reduced to 6,634,943 rows. “Benign” entries made up 88.31% of the cleaned data, indicating a reduction from the original dataset. “DDoS attacks-LOIC-HTTP” and “DDOS attack-HOIC” percentages increased somewhat, while “DDOS attack-LOIC-UDP” remained at 0.03%. These modifications represent the effect of data cleansing on the distribution of the various attack categories.
4.2.3. Data Normalization
Normalization is used in data preparation step of Machine Learning to standardize numerical column values and ensure they are on a consistent scale [
27]. Normalization, a transformation method, improves a model’s performance and accuracy greatly, especially when the distribution of information is uncertain. Without a consistent pattern, effective Normalization relies on large datasets to smooth data by removing outliers. This technique, which is critical in data preprocessing for Network Intrusion Detection Systems (NIDS), standardizes data to a given scale, often ranging from 0 to 1. This ensures that all features have consistent scales and ranges, thereby improving the performance and accuracy of NIDS. Several normalization approaches are employed in data pre-processing. Some of the most common are [
28].
Min Max Normalization: This approach reduces the values of a feature to a range between 0 and 1. It accomplishes this by subtracting the minimum value of the feature from each data point and then dividing the result by the range of the feature. This technique’s equivalent mathematical equation is shown below (1), where X is an original value, X’ is the normalized value [
29]
Z Score Normalization: This method scales a feature’s values to have a mean of 0 and a standard deviation of 1. This is accomplished by removing the feature’s mean from each value and then dividing by the standard deviation. mathematical equation for this strategy is given below (2), where X is an original value, X’ is the normalized value [
30].
4.3. Feature Selection
We compared two feature selection methods in this study: Principal Component Analysis (PCA) and Random Forest (RF). Here are the comparison’s specifics.
4.3.1. PCA
Principal Component Analysis is a sophisticated statistical approach used in data analysis and machine learning to reduce complex datasets. Its major goal is to decrease the amount of characteristics or dimensions in a dataset while retaining critical information. PCA does this by changing the original variables into a new set of variables known as principle components. These components, which are linear combinations of the original features, are intentionally made uncorrelated in order to capture the maximum variation in the data. PCA allows academics and data scientists to analyze high-dimensional data more effectively, identify patterns, and maximize the performance of machine learning algorithms by selecting the principal components that elucidate the most variability. PCA, in essence, simplifies both data interpretation and processing by condensing the information into a more comprehensible and insightful format [
17].
4.3.2. RF
Random Forest, in addition to being a powerful prediction model, is also a useful tool for feature selection in machine learning. Random Forest evaluates the value of each feature throughout the training process by determining how much it contributes to lowering impurity or inaccuracy in the model. Higher significance scores are ascribed to features that play a substantial influence in decision-making across multiple trees. Data scientists can find the most influential aspects in their dataset by examining these ratings. This inbuilt feature ranking capability simplifies the selection process, allowing practitioners to focus on the factors that will have the greatest impact on their study. The capacity of Random Forest to do feature selection improves model efficiency, reduces overfitting, and improves the general interpretability of machine learning systems [
18].
4.4. Classification model
Classification is the process of predicting the class of data. The IDS categorizes attacks as binary or multiclass, determining whether the network traffic is benign or malicious. Binary classification has two clusters, whereas multiclass datasets have n clusters. Because it requires categorizing into more than two categories, multiclass classification is considered more sophisticated than binary classification. This complexity imposes a strain on algorithms in terms of computational power and time, perhaps resulting in less effective algorithm outcomes. In the process of classification, each dataset is evaluated and categorized as either typical or unusual. Existing structures are maintained, and new instances are generated. Classification is employed for both identifying irregular patterns and detecting anomalies, although it is more frequently utilized for recognizing misuse. In the current study, eight machine learning techniques were applied, along with feature selection methods addressing class imbalances [
31].
4.4.1. XGBoost
XGBoost is a very effective machine learning method noted for its high predicted accuracy and speed. It is classified as ensemble learning since it combines predictions from numerous decision trees to generate strong models. What distinguishes XGBoost is its emphasis on overcoming the constraints of existing gradient boosting methods, resulting in a highly efficient algorithm. It accomplishes this by training simple models iteratively to repair faults and optimize performance using techniques such as regularization and parallelization. The capacity of XGBoost to handle complicated data relationships has made it a popular choice in a variety of industries, winning multiple machine learning competitions and finding applications in data science and finance [
19].
4.4.2. CART
CART is a versatile machine learning approach capable of performing both classification and regression problems. It divides the dataset recursively based on feature values, resulting in a tree structure with each node representing a feature and a split point. This operation is repeated until the specified halting requirements are met, resulting in the creation of a binary tree. CART is well-known for its simplicity and interpretability, making it a popular choice across a wide range of industries. It’s notably useful for detecting non-linear correlations in data and producing accurate predictions for both category and numerical outcomes [
20].
4.4.3. DT
A Decision Tree is a fundamental machine learning technique that can be used for classification and regression. It works by recursively splitting the dataset into subsets based on the values of the input features. These splits are determined by selecting traits and criteria that produce the best class separation or the most accurate predictions. Decision Trees have a tree-like structure with each internal node representing a feature and a split point and each leaf node representing the output, which is often a class label for classification tasks or a numerical value for regression tasks. The technique divides the data until a stopping condition is met, such as a maximum tree depth or a minimum amount of samples in a leaf node. Because they are simple to read and illustrate, decision trees are popular for exploratory analysis and decision-making processes [
21].
4.4.4. KNN
KNN is a basic powerful machine learning method that may be used for classification and regression problems. Predictions in KNN are based on the majority class or the average of the k-nearest data points in the feature space. “K” represents the number of nearest neighbors considered, and the method calculates distances between the query point and all other points in the dataset to discover the closest ones. In classification, the most prevalent class among these neighbors determines the forecast, whereas in regression, the average of the nearby values defines the prediction. KNN is non-parametric and instance-based, which means it makes no assumptions about the underlying data distribution, making it adaptable and simple to grasp. However, its performance can be affected by the option selected [
22].
4.4.5. Multilayer Perceptron (MLP)
MLP is a machine learning artificial neural network. It is made up of several interconnected layers, including an input layer, one or more hidden layers, and an output layer. Each node connection has a weight, and the network learns by altering these weights during training in order to minimize the discrepancy between expected and actual outputs. MLPs can describe complicated patterns and relationships in data, making them useful for applications like as classification, regression, and pattern recognition. They are very good at handling huge and complex datasets because of their capacity to capture nonlinear correlations, but they require careful tuning and a significant amount of training data to avoid overfitting [
23].
4.4.6. RF
RF is a machine learning technique that, during training, generates a set of decision trees. Each tree in the ensemble is built with a random subset of the data and a random subset of the features. For regression tasks, the algorithm makes predictions by averaging the forecasts of these individual trees, whereas for classification tasks, the algorithm takes a majority vote. Random Forest is well-known for its precision, robustness, and ability to handle complex data interactions. It reduces overfitting by pooling the predictions of several trees, making it one of the most popular and powerful machine learning techniques [
24].
4.4.7. LR
LR is a statistical technique used to perform binary classification tasks. Contrary to its name, it is utilized for classification rather than regression. The algorithm calculates the likelihood that a given input belongs to a specific class. The logistic function (also known as the sigmoid function) is applied to the linear combination of input features and their associated weights. The result is converted into a value between 0 and 1, signifying the likelihood of the input falling into the positive category. If this probability exceeds a certain threshold (typically 0.5), the input is considered positive; otherwise, it is considered negative. Logistic Regression is an essential tool in machine learning due to its simplicity, interpretability, and efficiency for linearly separable data [
25].
4.4.8. Bayes
Nave Bayes is a probabilistic machine learning technique that is used for classification jobs. It is based on Bayes’ theorem, which assesses the likelihood of a certain event occurring based on prior knowledge of factors that may be relevant to the occurrence. In the context of Nave Bayes, it is assumed that features in the dataset are conditionally independent, which means that the presence of one feature does not affect the presence of another. Despite this simplistic assumption (thus the term “nave”), Nave Bayes performs admirably in many actual applications, particularly text classification and spam filtering. It’s computationally efficient, simple to implement, and performs well with huge datasets, making it a popular choice for a variety of classification jobs [
26].
4.5. Evaluation model
This research evaluates an intrusion detection method using nine important criteria: In average accuracy k-fold cross-validation, accuracy, precision, recall, F1 score, PCC/BA, MCC, ROC, and average were calculated. Classification CPU Time and Model Size.
4.5.1 Evaluation accuracy, sometimes known as accuracy, is a fundamental parameter in analyzing the performance of machine learning models, notably in classification tasks. It computes the proportion of accurately predicted cases out of all instances in the dataset. High accuracy shows that the model’s predictions closely match the actual outcomes.
The F1 score gives more weight to the lower of the two values and is the harmonic mean of precision and recall. This indicates that if either precision or recall is low, the F1 score will be much lower as well. However, if both precision and recall are strong, the F1 score will be close to 1. This can result in a biased outcome if one of the measurements is significantly greater than the other [
4].
The Matthews correlation coefficient (MCC) is a more reliable statistical rate that produces a high score only if the prediction performed well in all four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally to the size of positive and negative elements in the dataset. MCC’s formula takes into account all of the cells in the Confusion Matrix. In machine learning, the MCC is used to assess the quality of binary (2-class) classification. MCC is a correlation coefficient that exists between the exact and projected binary classifications and typically returns a value of 0 or 1. mathematical equation for this strategy is given below (4) [
32], where TP as correctly predicted positives are called true positives, FN as wrongly predicted negatives are called false negatives, TN Actual negatives that are correctly predicted negatives are called true negatives, and FP Actual negatives that are wrongly predicted positives are called false positives
Receiver Operating Characteristic (ROC) as Most indicators can be influenced by dataset class imbalance, making it difficult to rely on a single indication for model differentiation [
33]. ROC curves are used to differentiate between attack and benign instances, with the
x-axis representing the False Alarm Rate (FAR) and the
y-axis representing the Detection Rate (DR).
The Probability of Correct Classification (PCC) is a probability value between 0 and 1 that examines the classifier’s ability to detect certain classes. It’s critical to understand that relying only on overall accuracy across positive and negative examples might be misleading. Even if our training data is balanced, performance disparities in different production batches are possible. As a result, accuracy alone is not a reliable measure, emphasizing the need of metrics such as PCC, which focus on the classifier’s accurate classification probabilities for individual classes.
Balanced accuracy (BA) is calculated as the average of sensitivity and specificity, or the average of the proportion corrects of each individually. It entails categorizing the data into two categories. mathematical equation for this strategy is given below (5) When all classes are balanced, so that each class has the same TN number of samples, TP + FN TN + FP and binary classifier’s “regular” Accuracy is approximately equivalent to balanced accuracy.
ROC score handled the case of a few negative labels similarly to the case of a few positive labels. It’s worth noting that the F1 score for the model is nearly same because positive labels are plentiful, and it only cares about positive label misclassification. The probabilistic explanation of the ROC score is that if a positive example and a negative case are chosen at random. In this case, rank is defined by the order of projected values.
Average Accuracy in k-fold cross-validation is a metric used to evaluate a machine learning model’s performance. The dataset is partitioned into k subsets, or folds, in k-fold cross-validation. The model is trained on one of these folds and validated on the other. This procedure is performed k times, with each fold serving as validation data only once. Averaging the accuracy ratings obtained from each fold is used to calculate accuracy. It ensures that the model is evaluated on multiple subsets of data, which helps to limit the danger of overfitting and provides a more trustworthy estimate of how the model will perform on unseen data.
In the context of evaluation, CPU time refers to the overall length of time it takes a computer’s central processing unit (CPU) to complete a certain job or process. When analyzing algorithms or models, CPU time is critical for determining computational efficiency. Evaluating CPU time helps determine how quickly a given algorithm or model processes data, making it useful for optimizing performance, particularly in applications where quick processing is required, such as real-time systems or large scale data processing jobs. Lower CPU time indicates faster processing and is frequently used to determine the efficiency and practical applicability of algorithms or models.
The memory space occupied by a machine learning model when deployed for prediction tasks is referred to as model size in classification. Model size must be considered, especially in applications with limited storage capacity, such as mobile devices or edge computing environments. A lower model size is helpful since it minimizes memory requirements, allowing for faster loading times and more efficient resource utilization. However, it is critical to strike a balance between model size and forecast accuracy; highly compressed models may forfeit accuracy. As a result, analyzing model size assures that the deployed classification system is not only accurate but also suited for the given computer environment, hence increasing its practicality and usability.
As a result, they are better suited for cases where the data is uneven.
5. Experimental Results & Discussions
In phase 1. We did preprocessing with Data Cleaning, Exploratory Data Analysis and Normalization. We double-checked for duplicates after selecting features. The dataset is divided into three sections: training, testing, and validation. To begin, the sample data is divided into two parts: 80 percent train data and 20 percent test data. See
Figure 2.
We deleted 10 features that included zero for every instance: Bwd PSH Flags, Fwd URG Flags, CWE Flag Count, Fwd Byts/b Avg, Fwd Pkts/b Avg, Fwd Blk Rate Avg, Bwd Byts/b Avg, Bwd Pkts/b Avg, and Bwd Blk Rate Avg.
Remove columns (‘Timestamp’) since we didn’t want learners to discriminate between attack forecasts depending on time, especially when dealing with more subtle attacks.
The labels were divided into four categories, where Label 0 denotes Benign, Label 1 represents DDoS attacks-LOIC-HTTP, Label 2 represents DDOS attack-HOIC, and Label 3 represents DDOS attack-LOIC-UDP.
We Remove feature contain NaN value, inf value, and rows containing duplicate values. There will be 6,634,943 records and 69 features.
After Data Cleaning and Exploratory Data Analysis we normalization dataset and converting the values of each feature to a specified scale, often ranging from 0 to 1. Min-Max normalization is a common method for this purpose, in which data are adjusted to fit inside a given range by subtracting the minimum value and dividing by the range. Z-score normalization is another strategy that standardizes features by subtracting the mean and dividing by the standard deviation, resulting in a mean of 0 and a standard deviation of 1. Normalization is especially crucial for algorithms that are sensitive to varied feature scales, since it ensures constant and fair comparisons of different qualities during the training phase.
In Phase 2, we split the process into two parts. Firstly, they reduced the number of features using PCA and RF techniques, then fed the processed data into classification models. Secondly, they used all the data without feature reduction and applied various classification models to evaluate the outcomes of data classification, including CPU runtime and model size.
We used PCA to minimize the number of features depending on certain variance ratios, resulting in several feature sets: 11 features as PCA11 for variance ratios greater than or equal to 0.006586494, 9 features as PCA9 for variance ratios greater than or equal to 0.017037139, 7 features as PCA7 for variance ratios greater than or equal to 0.036543147, 5 features as PCA5 for variance ratios greater than or equal to 0.052597381, and 3 features as PCA3 for variance ratios greater than or equal to 0.125926325.
Figure 3. depicts the importance of these variance ratios. Once these critical qualities were found, they were employed in phase 3 for data classification and further evaluation.
We used Random Forest (RF) to narrow down the feature set based on particular variance ratios. The following criteria were used to choose features: 22 features as RF22 for variance ratios greater than or equal to 0.02, 13 features as RF13 for variance ratios greater than or equal to 0.03, and 4 features as RF4 for variance ratios greater than or equal to 0.05.
Figure 4. Following the identification of these essential features, they were employed in phase 3 for data classification and further evaluation.
Phase 3 is the final stage in which the produced dataset is analyzed further. Methods for preprocessing and feature selection are translated into machine learning approaches that are extensively used by researchers in intrusion detection systems. Popular classification algorithms such as XGBoost, CART, DT, KNN, MLP, RF, LR, and Bayes are included. The evaluation of performance encompasses several dimensions, and the results are summarized here show in
Table 2.
Table 2 contains two sections: normalized data using the Min-Max and Z Score. The Min-Max Normalization section presents various classifiers’ performance metrics of various classifiers. XGBoost outperforms in all categories, including accuracy (0.999950), precision (0.975427), recall (0.982578), and F1 score (0.978946). It also has a high MCC and area under the ROC curve, showing that it performs well overall. DT and CART classifiers outperform XGBoost in terms of accuracy and balanced metrics, but with smaller model sizes and cheaper computing costs. RF has a high recall rate (0.982593) but a much greater model size and computational load. The recall of Bayes is impressive (0.984988), but it comes at the sacrifice of precision and overall accuracy. LR achieves an excellent balance of precision and recall, whereas MLP and KNN, respectively, specialize in high precision and recall. The classifier should be chosen based on specific needs such as accuracy, computational efficiency, or the trade-off between precision and recall, while also taking into account aspects such as model size and processing time. and The performance metrics of the classifiers based on Z-score scaling are reported in this investigation. XGBoost delivers high accuracy (0.999948) as well as high precision, recall, F1 score, and MCC. DT and CART classifiers outperform XGBoost in a variety of metrics while being more computationally efficient and requiring smaller model sizes. RF has a high recall rate (0.982353), but it has a much greater model size and a higher computational cost. Bayes excels in recall at the expense of precision and overall accuracy. LR achieves a good mix of accuracy and recall, whereas MLP has a high recall and KNN has a high precision. Specific needs, like as accuracy, computational efficiency, or trade-offs between precision and recall, should be considered when selecting a classifier, as should model size and processing time.
Because of the multiple evaluation criteria available, we chose to consider the ROC values, as well as the CPU time and model size. Among these factors, we chose three classifiers: DT, XGBoost, and RF, all of which produced very comparable evaluation findings. This choice was made when conducting feature selection trials.
Following that, the model was used in conjunction with feature selection approaches such as PCA and RF.
Table 3 displays the results of these tests.
Table 3 compares the efficacy of DT, XGBoost, and RF classifiers. The comparison is based on the use of both the min-max and Z-score normalization methods, as well as the deployment of feature selection approaches. The major goal is to shorten CPU runtime and reduce model size. It can be explained as follows.
The Min-Max Normalization and feature selection with PCA and RF section. The data presented provides a full comparison of various classifier settings as well as their performance indicators. Higher PCA dimensions often lead to greater accuracy, precision, and recall when assessing RF models with various feature selection approaches (PCA) and dimensions. Notably, RF-PCA11 and RF-PCA9 have accuracy levels more than 0.996145, illustrating the efficiency of feature selection in improving model performance. DT models combined with PCA also provide competitive accuracy, particularly at higher PCA dimensions. When the RF and XGBoost models are coupled, they exhibit extraordinary precision and recall, making them strong options for applications requiring balanced performance. When determining the best configuration for a given task, it’s critical to consider the trade-offs between accuracy, computational complexity (as measured by CPU time), and model size. This study emphasizes the significance of carefully selecting feature selection strategies and classifier combinations to produce best results tailored to individual needs. To improve understanding of model performance evaluation metrics, the researcher showed the data in the form of a radar graph, as shown in
Figure 5.
The data supplied demonstrates a thorough evaluation of multiple classifiers performance measures using Z-score scaling. When examining RF models in conjunction with principal component analysis (PCA) at various dimensions, greater PCA dimensions typically result in improved accuracy, precision, and recall. Specifically, RF-PCA11 and RF-PCA9 exhibit outstanding accuracy above 0.997387, demonstrating PCA’s usefulness in optimizing model outputs. DT models paired with PCA also perform well, especially with larger PCA dimensions. Furthermore, combining RF and XGBoost with PCA results in good precision and recall, making them solid candidates for applications requiring balanced performance. However, when choosing the optimal model configuration, it is critical to carefully analyze the trade-offs between accuracy and computational complexity, as indicated by CPU time and model size. This analysis emphasizes the importance of choosing appropriate PCA dimensions and classifier combinations to produce optimal and personalized outcomes based on unique job requirements. To improve understanding of model performance evaluation metrics, the researcher showed the data in the form of a radar graph, as shown in
Figure 6.
Following that, the models were used in conjunction with feature selection approaches such as PCA and RF. When used in conjunction with XGBoost, feature selection using PCA employing 11 features produced the best performance (considering ROC values combined with CPU run time). This was true whether the data was standardized using Min-Max or Z Score approaches, because the evaluation findings and CPU processing times were extremely similar (insignificant differences). As a result, both methodologies can be used effectively.
Table 4 displays the PCA features that were chosen, a total of 11 variables and show in
Table 4.
The
Table 5 compares various models to the CSE-CIC-IDS-2018 dataset, measuring their accuracy, training time, and other performance measures. S. Ullah et al. [
14] employed a Decision Tree (DT) with random feature selection (30 features) to achieve an astounding 0.9998 accuracy in a very low training period (0.18 seconds). Khan [
15] Used random feature selection to implement an HCRNNIDS model, obtaining 0.9775 accuracy in 200-250 seconds. F1 score and precision values were not specified. Kim et al. [
10] Convolutional Neural Network (CNN) with manual feature extraction was used. In a training duration ranging from 300 to 900 seconds, I achieved an accuracy of 0.960. F1 score, precision, and recall measures were not provided in detail. R. Qusyairi et al. [
3] Applied an ensemble model with 23 randomly chosen features. Although the accuracy was 0.988, no precise F1 score, precision, or recall statistics were provided. S. Chimphlee et al. [
4] used Min-Max normalization, Random Forest feature selection, and Class Balance (SMOTE), as well as Multi-Layer Perceptron (MLP). A high accuracy of 0.99462 was achieved, with significant precision and recall values.
Our Model 1 and Model 2: Both models used XGBoost with Principal Component Analysis (PCA) to pick features. Our Model 1 was 0.997706 accurate, while our Model 2 was 0.997698 accurate. Both models performed well across multiple parameters, including F1 score, precision, recall, ROC, and MCC.In conclusion, the proposed models demonstrate a variety of approaches, with PCA being particularly helpful in lowering feature dimensions while maintaining high accuracy. The models yield remarkable results in intrusion detection, reflecting the ongoing progress in the field of machine learning applied to cybersecurity.