[
23] The rapidly developing subject of AI for IT operations, or AIOps, combines AI and Machine Learning (ML) with IT operations to optimise and automate key processes related to resource management. [
22] An increasing number of Cloud Service Providers (CSPs) and tech giants such as Microsoft and Baidu are leading the charge in developing comprehensive AIOps solutions—as cloud computing continues to dominate the IT landscape. With its Azure-based AIOps solution, Microsoft has developed a complete strategy that addresses all four essential areas—execution, analysis, decision-making, and monitoring. With minimal downtime and better system response, our all-encompassing approach guarantees a smooth and continuous IT operation flow. Likewise, the features that Baidu'sAIOps solution concentrates on include anomaly detection, traffic planning, forecasting patterns, and root cause analysis. Additional frameworks and techniques created to improve fault tolerance in cloud systems are explored in the literature. Fault-tolerance techniques are thoroughly reviewed by [
9]—who emphasise that cloud systems must be able to adapt and learn from errors. Reactive, proactive, and resilient fault detection and recovery techniques are the three categories into which their research divides methods. Reactive techniques, such as restarting the system, are often used; however, the most important way to improve fault tolerance is through resilient techniques that anticipate errors and communicate with the cloud system in an intelligent manner. [
10]—who were among the first to recognise the significance of self-diagnosis and self-healing in cloud systems. [
11] has made several significant contributions to the subject—their suggested hybrid tool efficiently manages and recovers from system anomalies by combining multivariate decision-making with the Naive Bayes classifier. In the study on the effects of virtualization in cloud systems, [
12] highlighted how important it is for facilitating efficient load balancing. To assure data availability and integrity, their research uses data replication strategies. [
13] added to the AIOps landscape with the proposal of a self-healing framework that dynamically assesses cloud service performance and creates recovery plans accordingly. In order to minimise interruption, this preventive approach focuses on implementing recovery plans right away or preserving them for later use. A self-healing framework that dynamically assesses cloud service performance and creates recovery plans in accordance with the findings of [
14]—further improved the AIOps landscape. Using recovery plans either immediately or by storing them for later use is the main goal of this preventive strategy to guarantee the least amount of disturbance. [
15] presented a framework designed specifically for cloud-based web applications. With the use of this framework, users should be able to identify workload and performance anomalies and report them to CSPs so that appropriate modifications can be made. The contributions to AIOps from academia and industry are numerous and diverse. For instance, [
16]—proposed an autonomic cloud resource management technique—places a strong emphasis on reliable and reasonably priced cloud services. By identifying defects and implementing the necessary corrective measures, it prioritises self-healing. It also continuously assesses service quality in accordance with predetermined SLA parameters. Additionally, AIOps' practical use is demonstrated by solutions like Kubernetes and Winston from Netflix. Winston is an automation platform that operates based on events and allows for automated issue diagnosis and correction. The container self-healing feature is leveraged by Kubernetes, which is well-known for managing clusters of containers, to preserve system health without compromising the overall state.