2.1. Traffic Payload-Based Detection
Payload-based detection analyzes packet contents to identify signatures or anomalies indicative of botnets. Early systems used deep packet inspection (DPI) to match known botnet signatures defined manually or via regular expressions [
7]. However, hand-engineering detection rules is labor-intensive and fail to catch zero-day attacks. This motivated applying machine learning to learn models that separate benign and malicious payloads automatically.
Specific machine learning models for payload-based detection include multilayer perceptrons for general pattern recognition, convolutional neural networks for identifying spatial patterns in packet data, and recurrent neural networks for sequence modelling of network streams. For example,
Mashaleh et al.[
8] developed an IoT botnet detection system using convolutional and recurrent neural networks trained on fused packet payload and flow data, achieving 97.3% accuracy. Alsarhan et al. [
4] proposed an intrusion detection framework for vehicular networks using SVMs optimized with genetic algorithms (GAs) and particle swarm optimization (PSO). Features were extracted from network traffic payloads. Results showed GAs achieved higher accuracy versus PSO and unoptimized SVMs.
Younisse et al. [
9] addressed the lack of labeled multistage attack data by developing a novel kerberoasting attack dataset. Network traffic capturing a realistic attack scenario was processed to extract informative features for each attack stage. The resulting dataset provides valuable data reflecting the sequential steps of an actual attack life-cycle. This can enhance the testing and evaluation of intrusion detection systems against sophisticated multistage attacks. However, only a single attack type was included, and expanding to diverse botnet scenarios could strengthen its utility.
Alslman et al.[
10] focused on improving the robustness of payload-based intrusion detection systems against adversarial attacks. They utilized a denoising autoencoder technique to defend SNMP-MIB based botnet detection from multiple white-box and black-box attack types. Testing showed accuracy improved from 68% under attack to 90% with the defense applied. However, their approach concentrated only on protection rather than initial botnet detection capabilities. The study demonstrates the potential of autoencoders to harden payload-based detectors against evasion attempts.
AlMasri et al.[
11] addressed the lack of data for supply chain attacks by generating a novel dataset modeling the SUNBURST malware. The concise, practical dataset reflecting observable attack indicators will aid researchers in enhancing defences against this threat vector. However, only one machine-learning algorithm was used to validate the data. The specialized dataset is valuable for evaluating payload analysis methods against supply chain attacks.
Qabalin et al. [
12] collected a new payload-based dataset focused on Android spyware detection. Traffic was gathered reflecting spyware installation and operation activities on real devices. Multi-class experiments demonstrated 79% accuracy in identifying the spyware strains. However only random forest was tested as the classifier. The unique dataset advances malware detection research, although more complex models could be evaluated.
A limitation of payload-based detection is the inability to analyze encrypted traffic contents. Botnets increasingly leverage encryption and polymorphism to evade deep packet inspection. Alternative methods to inspect encrypted traffic metadata have been proposed. However, fundamentally, encryption reduces the efficacy of payload-based botnet detection.
Table 1 summarizes key papers in this area, their methods, datasets, results, pros, and cons.
The highlighted works showcase deep learning, optimization, dataset generation, and robustness techniques for advancing payload-based botnet detection. The machine learning innovations explored include CNN-RNN fusion, autoencoders, and ensemble learning. However, limitations exist around the evaluation of single datasets, focusing on defence rather than initial detection, and testing a few classifier types.
In conclusion, payload analysis continues to be a valuable approach for identifying botnet threats through network traffic monitoring. The surveyed papers demonstrate promising applications of machine learning to enhance detection accuracy, integrity against attacks, and generalizability across evolving botnets. Further research can build on these works by expanding evaluation across diverse, standardized datasets using a broader range of deep learning architectures in an adversarial setting. Advancing payload-based detection will require leveraging the full suite of modern machine-learning capabilities.
2.2. Flow-Based Detection
Network flows encapsulate key traffic statistics like duration, bytes, and timing between packets. Flow-based detection analyzes communication patterns to identify botnet anomalies and cluster similar flows.
Early work focused on hand-selecting flow features to improve botnet classification. Alauthman et al. [
1] used a reinforcement learning classifier trained on flow features, including flow duration, average bytes, and average packets per flow. More recent research automates feature engineering using machine learning. Alieyan et al.[
2] evaluated various classifiers, including SVMs, Naive Bayes, and decision trees trained on statistical flow features for botnet detection.
A benefit of flow-based detection is the ability to analyze encrypted traffic by relying exclusively on flow metadata. However, flow-based methods remain susceptible to mimicry attacks, which manipulate flows to emulate normal traffic patterns [
5]. Defending against flow-based evasion attacks remains an open challenge.
Useful flow features engineered from network traffic include duration, idle time between packets, packet counts, byte counts, byte-per-packet ratios, and variance in packet arrival intervals. Preparing raw flow data for effective machine learning requires careful preprocessing like normalization.
Table 2 summarizes key flow-based botnet detection research in terms of methods, datasets, results, advantages, and limitations.
The highlighted works showcase reinforcement learning and classical ML algorithms for learning from network flow metadata. Automating feature engineering eliminates manual selection needed in early systems. However, limitations exist around model overfitting and susceptibility to mimicry attacks.
In conclusion, flow-based detection provides a means to monitor botnets, including encrypted traffic, by relying on metadata patterns. The surveyed papers demonstrate initial applications of machine learning to extract useful flow features and comparisons between algorithms. Further research is needed to enhance model generalization and integrity against evasion attempts. Applying deep learning and adversarial training represents promising directions for progress in flow-based botnet detection.
2.3. DNS-Based Detection
The Domain Name System (DNS) offers a key control channel for botnets to locate C&C servers and receive updates. Analyzing DNS traffic can reveal botnet activity through anomalous lookup patterns. Botmasters use domain generation algorithms (DGAs) to create domain names contacting the C&C server randomly. Fast-flux manipulates DNS bindings to change IPs mapped to C&C domains rapidly. These behaviours deviate from legitimate DNS traffic [
13].
Machine learning has been applied to detect botnet DNS patterns using domain entropy, query counts, IP diversity, and time-series lookups.
Alkasassbeh & and Almseidin [
14] developed a DNS-based botnet detection system comparing multiple machine learning classifiers on DNS tunneling data. Their results showed random forest achieving the highest accuracy for detecting DNS tunnelling compared to algorithms like J48 and multilayer perceptrons. Evaluating multiple classifiers demonstrated machine learning’s capability to identify malicious DNS patterns. However, the scope was limited to DNS tunnelling data rather than diverse botnet traffic. Overall, the work highlights the promise of random forest models for DNS-based botnet detection. Extending the approach to other botnet detection tasks and datasets could further establish its capabilities.
Almomani et al. [
15] proposed an ensemble-based system using max voting for classifying darknet traffic. Combining random forest, KNN, and gradient boosting classifiers, their approach attained 98.76% accuracy on the CIC-Dark2020 dataset. The ensemble method integrates the outputs from diverse models to improve performance over single classifiers. However, the technique focused exclusively on darknet traffic classification rather than general botnet detection applications. The study indicates that ensemble learning is a valuable strategy for boosting DNS-based botnet detection accuracy. Similar voting ensemble approaches on wider network traffic data could enhance real-world botnet detectors.
A limitation of DNS detection is the ability of advanced botnets to mimic legitimate traffic via domain caching and other evasion tactics. DNS analysis also misses alternative C&C communication methods.
Table 3 summarizes key DNS-based botnet detection research.
The highlighted works demonstrate the effectiveness of ensemble learning and comparative classifier evaluations for improving DNS-based detection. However, limitations exist around darknet specificity and lack of diverse botnet data. In conclusion, DNS traffic analysis provides a valuable detection approach by identifying anomalous lookup patterns. The surveyed papers showcase applications of machine learning, especially ensemble methods, to extract useful DNS features. Further research should focus on expanding evaluation to general botnet datasets and real-world traffic. Applying deep learning for representational DNS data modelling also offers promise. Advancing DNS-based detection will require leveraging the full spectrum of modern machine-learning techniques.
2.4. Hybrid Detection
Hybrid detection combines multiple data sources like payloads, flows, and DNS to improve accuracy. The enriched features provide additional perspectives to identify heterogeneous botnet behavior’s.
Alieyan et al. [
16] evaluated a hybrid model merging DNS and flow features using various classifiers, including SVMs, Naive Bayes, and decision trees. Mashaleh et al.[
8] proposed an early IoT botnet detection framework fusing packet and flow features classified with SVMs, attaining 97.3% accuracy.
A core challenge is the large feature space resulting from merged data sources, risking model overfitting. Careful feature selection is necessary to determine optimal hybrid feature subsets. Complex neural architectures are also required to fuse the diverse data types effectively.
Hybrid systems fuse diverse data sources such as payloads, flows, and DNS to enrich perspectives for detecting heterogeneous botnets.
Table 4, summarizes key research on hybrid botnet detection approaches.
The highlighted works demonstrate classifier comparisons and deep learning models for hybrid detection. Fusing DNS, flow, and payload data improves visibility into diverse botnets. However, limitations exist around single model evaluation and lack of focus on evasion robustness.
In conclusion, hybrid detection leverages complementary data sources to enhance visibility into heterogeneous botnets. The surveyed papers present initial applications of machine learning, including deep neural networks, for fusing and analyzing multi-modal data. Further research should concentrate on robustness against attacks, scalable streaming methods, and emerging deep learning architectures. Advancing hybrid detection requires fully utilizing the latest machine learning innovations.
2.5. Anomaly-based
Almseidin & Alkasassbeh [
17] developed an anomaly based IoT botnet detection system using fuzzy rule interpolation. Their approach avoids binary decisions and provides interpretable outputs. Testing on an IoT botnet dataset yielded a 96.4% detection rate. However, fuzzy rule-based methods face challenges with incomplete rule bases. Overall, interpolative fuzzy logic shows promise for enhancing anomaly detection and explainability.
Almseidin et al. [
18] also applied fuzzy rule interpolation for detecting phishing website attacks. Their method can handle incomplete rule bases and smooth boundaries between normal and attack traffic. Evaluation on a phishing website dataset achieved 97.58% detection accuracy. While promising, the approach was tailored specifically for phishing attacks rather than general botnet detection. The interpolative reasoning enhanced robustness to rule base gaps.
Alkhamaiseh et al. [
19] proposed a multistage one-class SVM model for anomaly-based detection of unknown attacks. Using the SNMP-MIB dataset, their approach combines wrapper and filter feature selection to train the SVM classifiers. In testing, 97% of unknown attacks were successfully detected. However, the multistage SVM model is relatively complex. The study demonstrates the potential of one-class methods for identifying novel attacks.
Almseidin et al. [
20] developed an anomaly-based distributed denial-of-service (DDoS) attack detection system using fuzzy inference. Their approach aims to avoid binary decisions and provide human-interpretable outputs. Testing on a DDoS dataset yielded strong results of 96.25% accuracy and a 0.006% false positive rate. Fuzzy logic enhances the IDS alert system by delivering more nuanced attack assessments instead of binary decisions. However, a limitation is that the evaluation was restricted to a single DDoS dataset, and generalizability to other botnet types is unclear. Overall, the paper demonstrates the potential of fuzzy inference systems to improve anomaly-based botnet detection’s robustness, explain ability, and precision. The interpolative reasoning capability enables the generating of alerts even with incomplete rule bases. Further research can build on this explainable AI approach for botnet detection across diverse datasets and attack types.
Table 5 summarizes key research in this area.
The highlighted works demonstrate the benefits of explainable models like fuzzy systems and one-class SVMs for detecting anomalies and unknown threats. Key limitations include specificity to attack types and model complexity.
In conclusion, anomaly detection provides an important paradigm for identifying zero-day botnet attacks. The surveyed papers showcase promising applications of interpretable machine learning models to enhance the detection of novel threats. Further research should improve model generalization, integration, and evaluation of diverse, real-world data. Overall, explainable anomaly detection exhibits significant potential for advancing botnet detection systems.