Actionable Knowledge Discovery for Infrastructure  Reliability via Analytics and DevOps Automation

Mr. Satbir Singh

Authors

Mr. Satbir Singh Security & DevOps Strategist, Independent Researcher, CA, USA Author

Abstract

In the era of distributed cloud-native infrastructures, ensuring high reliability and rapid recovery from failures has become a critical operational challenge. Traditional monitoring systems, which rely on threshold-based alerts and manual inspection of logs, are increasingly inadequate in handling the volume, velocity, and variety of observability data. This research proposes a modular and interpretable framework that integrates advanced data analytics, unsupervised anomaly detection, log pattern mining, and intelligent inference to discover actionable knowledge from large-scale system telemetry. By leveraging open datasets (e.g., Alibaba and Google traces) and combining metrics with structured logs, the system enables proactive failure detection and root cause analysis across services. The inferred insights are tightly coupled with DevOps automation workflows, such as CI/CD rollbacks and container restarts, to reduce Mean Time to Detection (MTTD) and Mean Time to Recovery (MTTR). Experimental evaluations demonstrate up to 60% improvement in detection times and 45% improvement in recovery durations compared to traditional setups. The architecture is designed for extensibility, human oversight, and deployment in real-world hybrid environments. This work advances infrastructure resilience by bridging the gap between observability and intelligent, policy-aware automation

Actionable Knowledge Discovery for Infrastructure Reliability via Analytics and DevOps Automation

Authors

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Most read articles by the same author(s)

Make a Submission

Google Scholar

Crossref DOI

ResearchGate

ISO

Information

Latest publications

Browse