LLM-Enhanced Intelligent Fault Diagnosis and Self-Healing Framework for Cloud Computing Systems
Existing methods for fault detection in cloud and quantum systems are powerful but brittle. They struggle with unknown failures, rely on inflexible recovery playbooks, and use fixed quantum error correction (QEC) schemes, a significant problem in diverse multi-cloud settings. To overcome these issues, we introduce textbf{Intelligent Multi-Cloud Fault Detection with Adaptive Quantum Error Correction}. Our framework is built on three pillars: hierarchical multi-agent learning, adaptive multi-cloud execution, and predictive QEC. Specialized agents learn from experience, while the system adapts to real-time cloud performance and quantum error states. How effective is this approach? Testing on the CloudSim Fault Injection Dataset, Multi-Cloud Performance Benchmark, and IBM Quantum Error Logs shows its real-world impact. We achieved 94.2% detection accuracy, cutting false positives by 68%. System availability jumped from 85% to 96.1%, and recovery time plummeted from 340s to just 45s. For quantum workloads, the framework reached a 96.7% success rate with 94.3% state fidelity. This work offers a more robust and adaptive solution for fault management in today’s complex hybrid cloud-quantum environments.