Actionable Knowledge Discovery for Infrastructure Reliability via Analytics and DevOps Automation
Abstract
In the era of distributed cloud-native infrastructures, ensuring high reliability and rapid recovery from failures has become a critical operational challenge. Traditional monitoring systems, which rely on threshold-based alerts and manual inspection of logs, are increasingly inadequate in handling the volume, velocity, and variety of observability data. This research proposes a modular and interpretable framework that integrates advanced data analytics, unsupervised anomaly detection, log pattern mining, and intelligent inference to discover actionable knowledge from large-scale system telemetry. By leveraging open datasets (e.g., Alibaba and Google traces) and combining metrics with structured logs, the system enables proactive failure detection and root cause analysis across services. The inferred insights are tightly coupled with DevOps automation workflows, such as CI/CD rollbacks and container restarts, to reduce Mean Time to Detection (MTTD) and Mean Time to Recovery (MTTR). Experimental evaluations demonstrate up to 60% improvement in detection times and 45% improvement in recovery durations compared to traditional setups. The architecture is designed for extensibility, human oversight, and deployment in real-world hybrid environments. This work advances infrastructure resilience by bridging the gap between observability and intelligent, policy-aware automation
Downloads
Published
Issue
Section
License
Copyright (c) 2024 International Journal of Business Management and Visuals, ISSN: 3006-2705

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.