4.1. Introduction
The findings of this analysis substantially enhance the understanding of the evolving global cyber threat landscape, shedding light on RQ1 concerning key trends and patterns in cyber-attacks over the analyzed period. These findings also substantiate H1, demonstrating the growing sophistication and target specificity of cyber-attacks over time.
Although many entries from specific countries or IPs do not necessarily denote malicious intent, the data still provides insight into zones of high cyber activity. This information could assist in creating geographically tailored cybersecurity strategies, addressing the growing sophistication and target specificity identified in H1. However, the caveat that cybercriminals often disguise their actual location must be noted, which may result in the geographical data not accurately reflecting the attacker's original location.
Notably, the surge in daily attacks, especially those revealing increasingly sophisticated methods, suggests periods of intense, organized activity, possibly linked to specific events or campaigns. This underscores the patterns and evolving tactics outlined in RQ1 and H1. Moreover, the analysis exposes the exploitation of a broad range of increasingly complex tactics by perpetrators, targeting various services, including those used for remote access and network diagnostics. The progression in the diversity and complexity of these tactics aligns with the trends identified in the research question and hypothesis, providing clear evidence of the ever-evolving threat landscape.
The highlighted trends and patterns potentially inform cybersecurity strategies, policymaking, and resource distribution, suggesting the necessity of increasingly multifaceted and anticipatory measures to counteract the evolving sophistication and target specificity of cyber-attacks.
4.2. Interpretation of Results
The results derived from the data analysis reveal fascinating insights about the intricacies and diversity of network traffic behavior, affirming RQ1 by illustrating the key trends and patterns in cyber-attacks over the analyzed period. The results support H1, proving cyber-attacks have become increasingly sophisticated and targeted.
The grouping of network traffic data into distinct clusters demonstrates the variability in network behavior patterns, an indispensable component in understanding when crafting robust security measures to combat increasingly sophisticated threats. Each cluster signifies unique network characteristics, requiring specialized preventative and responsive measures to effectively safeguard network security in the face of growing attack specificity.
The time-series analysis, capturing the temporal patterns in attack counts, pinpoints periods of unusual activity or anomalies, which could be attributed to the increasing sophistication of cyber-attacks, as suggested in H1. Identifying periods with spiked attack counts, particularly in July 2017, December 2018, and October 2019, reinforces the need for a temporal approach to network security, especially as techniques evolve, as indicated by RQ1.
The behavior score, ranging from 0 to 136, is a quantifiable measure for potential anomalies and serves as a tool to gauge the increasing sophistication and target specificity of cyber-attacks. The validation of these scores underscored their effectiveness as reliable indicators of abnormal behavior.
Geographical and autonomous system data play a crucial role in understanding the sources of network anomalies. The higher frequency of abnormalities stemming from the United States and Germany and specific autonomous systems such as DIGITALOCEAN-ASN, F3 Netze e.V., and Zwiebelfreunde e.V. suggests these areas and systems warrant close attention given the evolving nature and increasing specificity of cyber-attacks outlined in RQ1 and H1.
While these findings cast light on the nature and sources of network anomalies, it's noteworthy that the behavior score indicates the potential for abnormalities but not the specific type or severity of the anomaly. This aligns with H1's assertion of increased sophistication, as newer attacks may deviate from known patterns. Future research should therefore explore ways to enhance the current methods with mechanisms to discern the specific nature and potential impact of detected anomalies, particularly as threats become more complex and targeted.
In conclusion, the interpretation of these results underscores the multifaceted nature of network traffic and the imperative for a comprehensive approach to ensuring network security. As RQ1 and H1 indicate, the dynamic nature of cyber threats calls for a multi-pronged approach to countering them, integrating temporal, geographical, and autonomous system data along with a quantitative measure of behavior. These aspects should all be considered to effectively identify and address network anomalies in an ever-evolving threat landscape.
4.3. Data Collection and Preprocessing
The comprehensive and rigorous data collection and preprocessing procedures undertaken in this study significantly enhanced the reliability and validity of the findings. Before the analysis, the data underwent meticulous cleaning, normalization, and transformation processes to ensure consistency and validity. Feature extraction and data transformation techniques were applied to the dataset, playing a pivotal role in extracting relevant information and ensuring the overall quality of the study's results. The careful data collection and preprocessing procedures enhanced the preparation of the dataset for subsequent analysis, contributing to the robustness of the study's outcomes.
The findings from this dataset reveal a complex and multifaceted cyber threat landscape. The striking disparities in the origin of attacks highlight the global nature of the cyber threat, pointing toward the need for enhanced international cooperation and coordination in addressing cyber threats. However, it is also important to note that these disparities may be influenced by a range of factors, including the digital infrastructure, policies, and practices in different regions, as well as the ability of attackers to disguise their actual location.
These insights underscore the need for continuous monitoring and analysis of cyber activities and for developing effective and adaptive strategies to mitigate cyber threats. This study demonstrates the advantage of such comprehensive data collection and preprocessing efforts in generating critical insights that can inform policy and practice in cybersecurity.
4.3.1. Descriptive Analysis
The observed daily frequency of approximately 45,741 entries and the peak of 888,203 attacks in a single day reveal the scale and intensity of cyber threats. The sporadic non-attack days, such as November 16, 2016, could suggest periods of relative calm or possibly a shift in attack strategies. These patterns underscore the dynamic nature of the cyber threat landscape, requiring constant vigilance and adaptive responses.
The analysis underscores the erratic and volatile nature of cyber-attacks, with daily counts varying wildly over the six years. The high degree of variation and the skewed distribution highlights the challenge of predicting and preparing for cyber threats. Days with no recorded attacks are rare (17 out of 2,191 days), reinforcing the constant nature of the cyber threat landscape.
The marked distribution disparity points towards the global nature of cyber threats, highlighting the necessity for international cooperation to mitigate these threats effectively. However, it is essential to remember that these distribution disparities might not fully represent the actual origin of the attacks, as cybercriminals often obscure their real locations.
The descriptive analysis of the honeypot log presents a quantitative understanding of the cyber threat landscape. The observed distribution disparities, peak activity, and periods of calm comprehensively depict cyber activities. This study lays the groundwork for further analysis and interpretation of cyber threats, emphasizing the importance of data-driven strategies to strengthen cybersecurity. The findings underscore the dynamic and complex nature of the cyber threat landscape, reiterating the need for robust and adaptive cybersecurity measures informed by meticulous data analysis.
4.3.2. Temporal Analysis
The temporal analysis yielded a critical understanding of the cyclical trends in cyber-attacks. The marked peaks in July 2017 and October 2019, followed by an overall increase in attack volumes from late 2019 onwards, point to an evolving and escalating cyber threat landscape. These patterns suggest that cyber threats are becoming more sophisticated and targeted, aligning with the initial hypothesis (H1) that cyber-attacks show a marked increase in sophistication and target specificity over time.
However, it is essential to consider the possibility of attack automation and an overall increase in Internet activity contributing to these high volumes. The variations in attack volumes could also indicate changing attacker tactics, advancements in detection methods, or the influence of global events. Consequently, these temporal trends necessitate ongoing evaluation to adapt and update cybersecurity measures in response to the evolving threat landscape.
The findings emphasize the importance of continual monitoring, evolution, and adaptation of cybersecurity strategies to detect and mitigate threats effectively. The study substantiates the growing significance of data-driven analytical approaches to understanding and addressing the complexities of cyber threats in the evolving digital era.
4.3.3. Correlation Analysis
The moderate to high correlations observed between the source AS numbers, corporate names, and numerous other indicators of malicious Internet activity suggest potential associations within the parameters studied. Such meaningful relationships may assist in predicting and identifying malicious activity based on known patterns. However, it must be emphasized that correlation does not imply causation, thereby necessitating further examination to ascertain causal relationships between these variables.
In interpreting these correlations, one could hypothesize that attackers may utilize specific AS numbers, as indicated by the high correlations. However, additional factors such as the nature of the organization and its Internet traffic, the network infrastructure, and other contextual factors could influence these correlations. Therefore, considering these variables in future investigations would be crucial to validate and comprehend the observed correlations better.
The study underscores the necessity for a cautious interpretation of these correlations and the importance of further research to establish causal links. These findings highlight the potential of data-driven, statistical approaches to augment understanding and predict cyber threats, contributing to more efficient and proactive cybersecurity strategies.
4.3.4. Geographic Analysis
The geographic distribution of cyber-attacks offers crucial insights into the patterns of malicious cyber activity. The significant fraction of cyber activities originating from the United States, Russia, and China could indicate several factors, including technological advancement, economic influence, and geopolitical relevance. However, it's worth considering that cybercriminals frequently mask their precise location, which could skew the geographic data. Furthermore, the high concentration of cyber activity within the top 20 countries might reflect their technological infrastructure and international standing. Such insights could be instrumental in shaping geographically precise cybersecurity policies and strategies. However, future studies should address the potential discrepancies resulting from attackers' masking of specific locations.
These findings emphasize the global nature of cyber threats and highlight the importance of international cooperation and strategy development in cybersecurity. However, it is crucial to note the potential for location obfuscation by attackers, indicating the need for additional corroborative strategies to accurately trace the origins of cyber threats. These threats' complex and international nature necessitate a multifaceted and global response.
4.3.5. Threat Analysis
The threat analysis presented in the study underscores the complexity and diversity of the cyber threat landscape. A substantial number of unidentified threats (zero-count entries) emphasize the continual evolution of cyber threats and the limitations of current threat intelligence repositories in capturing the complete range of malicious activity. The prominence of specific categories in the non-zero count entries signifies the prevalence of particular types of malicious activities or sources, providing valuable insights for devising targeted defense strategies. However, it's also critical to note the importance of minor categories, which, although constituting a smaller portion of the dataset, may represent emerging or less common threat vectors that warrant further exploration.
The significant number of unidentifiable threats reiterates the need to continuously enhance threat intelligence repositories and adopt adaptive, multifaceted cyber defense strategies. The study's findings highlight the importance of ongoing research to understand the rapidly changing nature of cyber threats and develop effective strategies to counter them.
4.3.6. Source IP Address Analysis
The study highlights the importance of scrutinizing the source IP address variable in understanding the origins and patterns of cyber-attacks. The findings suggest concentrated sources of attacks from specific IP addresses and ASNs, pointing towards the potential utilization of botnets or centralized attack mechanisms. Notably, a significant percentage of entries were linked to the top 20 IP addresses, suggesting a concentrated nature of cyber threats. The findings indicate a need for increased vigilance even in environments perceived to be trustworthy, particularly considering the predominant utilization of reputable cloud services as attack vectors. Understanding the dispersion and concentration of attacks from individual source IPs informs the development of targeted defense mechanisms and fosters international collaboration to counter cybercrime effectively.
When cross-referenced with threat intelligence data repositories, the comprehensive analysis of source IP addresses revealed critical insights into the distribution of cyber threats. The study reaffirms the necessity of an exhaustive analysis of source IP addresses to comprehend cyber-attack patterns and develop effective threat detection and prevention strategies. By fostering international collaboration and sharing these insights, this approach contributes to the broader cybersecurity field's capacity to navigate the myriad of cybersecurity challenges.
4.3.7. Destination Ports Analysis
The study's findings suggest an increasing sophistication and targeted approach to cyber-attacks over time. The high prevalence of attacks on services like the VNC-Server (port 5900) that require more sophisticated attack vectors compared to standard ports such as HTTP (443) or SSH (22) reinforces this observation. The data points to a high concentration of attacks from specific IP addresses and ASNs, implying the potential use of botnets or centralized attack mechanisms. Using reputable cloud services to initiate attacks emphasizes the need for advanced security measures.
The "count_diff" data provides a dynamic perspective on the changes in network traffic over the years. Cyber-attacks have become more targeted and sophisticated, with changing preferences for specific ports across different years. The fact that ports such as 5900 and 8 show a marked increase in traffic points to shifting attacker strategies. Conversely, a decrease in traffic for port 22 may suggest changes in the targeted systems' security measures or network configurations. This could benefit future cybersecurity studies and equip network administrators with vital information to enhance network security measures. The study thus provides a critical understanding of the cyber threat landscape, emphasizing the importance of constant vigilance and adaptability in the face of evolving cyber threats.
4.3.8. Destination Services Analysis
The study's results suggested an increased focus on less known or difficult-to-categorize services, indicative of a rise in the complexity and sophistication of cyber-attacks. This finding aligns with the initial hypothesis. A consistent pattern of annual increases in attacks was noted for certain services such as 'ICMP-Echo-Request,' 'Unknown,' and 'VNC-Server.' In contrast, other services, such as 'bgp' and 'Domain-s,' were only recorded in specific years.
The "Cluster" column, introduced through a KMeans clustering algorithm, provided additional depth to the analysis. It grouped destination services into clusters based on similarity, revealing distinct patterns for services like 'Unknown,' 'VNC-Server,' 'ICMP-Echo-Request,' and 'ssh.'
The analysis of "Destination Services" and the incorporation of the "count_diff" data and KMeans clustering painted a comprehensive picture of the evolving nature and complexity of cyber-attacks. The analysis of destination IP services revealed a diverse range of targeted services. It demonstrated a marked increase in attacks on less known or harder-to-categorize services, indicative of an increase in the complexity and sophistication of cyber-attacks. These results are of immense value to network administrators and security professionals, providing vital insights for developing and reinforcing robust cybersecurity measures in response to the evolving threat landscape.
4.3.9. Autonomous System Numbers and Names Analysis
Despite the significant network activity linked to entities such as DigitalOcean, Amazon-AES, and Amazon-02, it is crucial to understand that these organizations' high entry numbers do not necessarily signify direct involvement in malicious activities. These numbers might reflect the large customer bases of these organizations, which could potentially include users exploiting these services for nefarious activities.
Temporal trends demonstrate the ever-changing nature of the cyber threat landscape. The fluctuations observed in specific ASNs over the years highlight the need for continuous monitoring and updating of cybersecurity measures to match the evolving nature of threats. Furthermore, the cluster analysis of ASNs offered more profound insights into the patterns of malicious network activity, indicating the changing landscape of cyber threats.
The analysis of ASNs revealed distinct patterns of network activity linked to malicious intent, with significant variations across different ASNs and years. The findings emphasized the critical role of robust cybersecurity measures and continuous cyber threat analysis in understanding and combating these evolving threats. By shedding light on the temporal behavior and clustering characteristics of ASNs, this analysis provides insights for future research in this area, thereby contributing to a broader understanding of cyber threats and strengthening the defenses against them.
4.3.10. Behavior Analysis
As a metric, the behavior score demonstrated its potential in discerning anomalous from expected network behavior. This approach leverages the inherent structure of the Internet, employing AS numbers and organizations as critical factors in behavior analysis.
In the context of cyber threat intelligence, these results highlight behavioral patterns' significant role in network traffic analysis. Countries like the United States and Germany, through their AS numbers and organizations, exhibited higher behavior scores, signaling potential security threats. Notably, these countries are significant Internet nodes, reinforcing the necessity of vigilant cyber security measures in these regions.
However, it is essential to consider that a higher behavior score may not directly correspond to malicious intent. Network traffic can exhibit strange behavior for several reasons, such as configuration changes, software updates, or non-standard user behavior. Therefore, these results should be interpreted with caution and need to be corroborated with additional data or context.
This study shed light on the potential of using behavior scores as an effective tool for anomaly detection in network traffic. The high behavior scores associated with specific AS numbers and organizations emphasize the need for rigorous and continuous monitoring of these entities. These findings, coupled with the distribution of behavior scores, offer valuable insights for cyber security practitioners in their ongoing efforts to detect, mitigate and prevent cyber threats.
While the study offers promising results, future work should focus on refining the behavior score by incorporating more diverse factors. This will help in reducing false positives and enhancing the precision of the anomaly detection process. Also, further research is required to understand the reasons behind the elevated behavior scores observed for certain entities to facilitate more effective threat intelligence.
4.3.11. Clustering Analysis
The clustering analysis indicates shifts in the patterns of attacks over time, with periods of higher and lower attack counts. The clusters formed to understand how the attack counts evolved, with cluster 2 indicating lower attack counts, cluster 0 showing higher attack counts than cluster 2, and cluster 1 having the highest attack counts.
The anomaly detected in October 2019, despite the general high attack counts during the period, signifies an unusual increase that deviated from the established pattern. This anomaly, marked by an exceptionally high attack count, underscores the need to understand and prepare for such extreme instances.
The clustering and anomaly detection analysis offers a robust method for understanding the patterns and shifts in cyber-attack counts over some time. This understanding is vital in enhancing the preparedness and responsiveness of cybersecurity defenses to such threats. The distinction in attack counts represented by different clusters and the detection of anomalies offer valuable insights into the dynamic nature of the cyber threat landscape. The presence of outliers, such as the anomaly detected in October 2019, emphasizes the need for continuous monitoring and evaluation of the threat landscape to anticipate better and manage cybersecurity risks.
4.3.12. Anomaly Detection with Clustering
The integration of time series analysis and destination port clustering offers an in-depth perspective on the continually shifting cyber threat landscape. Persistent high-attack clusters, such as Cluster 2 in the early data and Cluster 1 in the latest data, denote sustained areas of vulnerability, indicating the need for bolstered cybersecurity measures.
Identifying anomalies, such as the spike in attack count in July 2017 and October 2019, underscores the need for dynamic and adaptable cybersecurity strategies capable of responding to abrupt shifts in attack patterns.
The correlation between the high attack clusters and specific destination ports (8, 587, 22, and 5900) implies that these ports may be targets or particularly vulnerable points in the network.
Identifying anomalies in the data is crucial for understanding sudden shifts or surges in cyber-attacks. These anomalies could be indicators of coordinated large-scale attacks, the discovery of a new vulnerability by attackers, or a change in attack techniques.
Moreover, these anomalies' correspondence with specific destination ports suggests a targeted approach by the attackers. For instance, ports 8, 587, 22, and 5900 are associated with high attack clusters during abnormal periods, implying that these ports might have been specifically targeted or were particularly vulnerable.
Incorporating clustering and anomaly detection with time series data is essential for understanding patterns, shifts, and abnormalities in cyber threats over time. Ports and times with high attack counts require priority in cybersecurity strategies. Identifying anomalies and evolving attack patterns underscores the need for continuous threat monitoring and adaptable response strategies in the face of a rapidly changing cyber threat landscape.
Anomaly detection plays a pivotal role in cyber threat analysis. Identifying and understanding anomalies allow early detection of significant threats to facilitate prompt and effective responses. The abnormalities identified in this analysis highlight the importance of continuous monitoring and adaptive security measures to handle sudden shifts in attack patterns. The association of certain anomalies with specific destination ports provides valuable insights into potential vulnerabilities or targeted attack points. Therefore, anomaly detection and port clustering serve as effective instruments for a thorough understanding of the cyber threat landscape.
4.4. Comparison to Previous Research
The results of the present study align with the existing literature on network anomaly detection while also providing unique insights. Consistent with previous research, the study confirms the significance of machine learning in detecting anomalies in network traffic [
8,
9,
11,
12,
14,
19,
21]. However, it extends this premise by focusing on network behavior anomalies characterized by unusual network patterns potentially indicative of cyber threats.
The behavioral scoring system used in this study aligns with the approach of Alsarhan [
8], Boateng [
21], and Mengidis et al. [
12], who utilized machine learning methodologies for anomaly detection. It diverges, however, by tying the scoring to a combination of Autonomous System Numbers and Names (ASNs), the country of origin, and the number of connections made. This multifaceted approach to scoring contributes to a more holistic view of network behavior.
The assertion made in the current study about the significance of IP address and ASN in identifying anomalous network activities finds support in the work of Alowaisheq [
17] and Li [
6]. Yet, the current research extends this understanding by providing quantifiable evidence through a behavior-scoring mechanism that connects these factors with the frequency and nature of abnormal behavior, a contribution not previously articulated in such detail.
In previous studies, Aboah Boateng [
21] and Mengidis et al. [
12] incorporated unsupervised machine learning methods to identify anomalies in process control systems and host-based intrusion, respectively. This study utilizes a similar unsupervised machine learning approach but applied to an entirely network-centric dataset.
Similar to the work of Fu et al. [
20] that applied a reduction method to intrusion detection data, this study also emphasizes the need for data reduction and dimensionality reduction techniques. However, the research leverages both source IP addresses and ASNs to perform the reduction process, enhancing the efficiency of anomaly detection.
Research by Moriano Salazar [
23] has pointed out the significance of analyzing real-world temporal networks. This study echoes this sentiment, emphasizing the necessity of continuous and real-time monitoring of network behavior due to the dynamic nature of cyber threats.
This study aligns with the work of Alowaisheq [
17], who examined security traffic from different perspectives: defenders, attackers, and bystanders. Similarly, this research analyzed network behavior from multiple angles, considering the origin of traffic and the behavior associated with that origin.
Moreover, the current study's emphasis on continuous monitoring and updating of models as cyber threats evolve [
11,
23] adds to the narrative espoused by Moriano Salazar [
23] and Ongun [
11] concerning the temporally dynamic nature of network behavior. However, the research goes a step further by integrating this concept into a framework for practical implementation, thereby offering actionable insights for the cybersecurity community.
The present study, however, diverges from previous research in its emphasis on a behavior-based scoring system tied to ASNs and the country of origin. While Chatterjee [
18] employed deep learning mechanisms for network intrusion detection, using a behavior-based scoring system provides a unique and potentially more accessible approach to identifying and assessing the severity of anomalies.
In summary, the current study expands the knowledge in network anomaly detection, building upon the foundation established by prior research while providing new insights through a unique behavior-based scoring system. As with all research, these findings should be viewed as a point of departure for future studies, continually refining and enhancing the understanding of network behavior anomalies.