The growing complexity of multicloud data center infrastructures, together with the rapid increase in telemetry streams, event logs, and protocol notifications, makes timely incident detection and fault localization a critical problem for modern telecommunication systems. Conventional monitoring approaches mainly rely on the analysis of large historical logs, which limits their ability to identify short-term correlated failures and support rapid operational decision-making. The aim of this paper is to develop a software-oriented architecture for intelligent monitoring of data center network infrastructure based on localized telemetry buffering, centralized event correlation, and AI-assisted diagnostics. The proposed method combines local short-term telemetry snapshots formed on network and server nodes, critical-event clustering within a limited time window, root cause analysis using correlation of protocol, resource, and trap data, and risk estimation through a machine learning model. As a result, a five-layer monitoring architecture and an incident-processing algorithm were developed, including local agents, temporary telemetry buffers, a central correlation node, a failure knowledge base, and an AI assistant for troubleshooting support. The results show that the proposed approach reduces the amount of data requiring urgent processing, improves the identification of causally related events, and decreases the time needed to interpret critical incidents. The practical relevance of the study lies in the possibility of deploying the system as a software layer over existing monitoring and logging infrastructure without hardware modification. The theoretical relevance lies in the formalization of localized incident-context modeling, critical-event clustering, and AI-assisted root cause identification for multicloud data center environments.
M. Nouioua, P. Fournier-Viger, G. He, F. Nouioua, and M. Zhou, “A survey of machine learning for network fault management,” in Machine Learning and Data Mining for Emerging Trend in Cyber Dynamics, Springer, 2021, [Online]. Available: https://doi.org/10.1007/978-3-030-66288-2_1.
M. Landauer, S. Onder, F. Skopik, and M. Wurzenberger, “Deep learning for anomaly detection in log data: A survey,” Machine Learning with Applications, vol. 12, art. no. 100470, 2023, [Online]. Available: https://doi.org/10.1016/j.mlwa.2023.100470.
D. L. Vajda, T. V. Do, T. Bérczes, and K. Farkas, “Machine learning-based real-time anomaly detection using data pre-processing in the telemetry of server farms,” Scientific Reports, vol. 14, art. no. 23288, 2024, [Online]. Available: https://doi.org/10.1038/s41598-024-72982-z.
T. Wittkopp, P. Wiesner, and O. Kao, “LogRCA: Log-based root cause analysis for distributed services,” in Euro-Par 2024: Parallel Processing, LNCS, vol. 14802, Springer, 2024, pp. 362-376, [Online]. Available: https://doi.org/10.1007/978-3-031-69766-1_25.
T. Wang and G. Qi, “A comprehensive survey on root cause analysis in (micro) services: Methodologies, challenges, and trends,” arXiv preprint, 2024, [Online]. Available: https://arxiv.org/abs/2408.00803.
W. Guan, J. Cao, S. Qian, J. Gao, and C. Ouyang, “LogLLM: Log-based anomaly detection using large language models,” arXiv preprint, 2024, [Online]. Available: https://arxiv.org/abs/2411.08561.
Z. Yao, C. Pei, et al., “Chain-of-Event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph,” in Proc. ACM Found. Softw. Eng. (FSE), 2024, [Online]. Available: https://doi.org/10.1145/3663529.3663827.
Y. Gebreyesus et al., “AI for automating data center operations: Model explainability in the data centre context using SHAP,” Electronics, vol. 13, no. 9, art. no. 1628, 2024.
P. Himler, M. Landauer, F. Skopik, and M. Wurzenberger, “Anomaly detection in log-event sequences: A federated deep learning approach and open challenges,” Machine Learning with Applications, vol. 16, art. no. 100554, 2024, [Online]. Available: https://doi.org/10.1016/j.mlwa.2024.100554.
L. Pham, H. Ha, and H. Zhang, “Root cause analysis for microservices based on causal inference: How far are we?,” in Proc. IEEE/ACM Int. Conf. Automated Software Engineering (ASE), 2024, pp. 706-718.
E. Edozie et al., “Artificial intelligence advances in anomaly detection for telecom networks,” Artificial Intelligence Review, 2025, [Online]. Available: https://doi.org/10.1007/s10462-025-11108-x.
I. Kotenko, D. Gaifulina, and I. Saenko, “Systematic literature review of security event correlation methods,” IEEE Access, vol. 10, pp. 43387-43420, 2022, [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3168976.
H. Maosa et al., “A hierarchical security event correlation model for real-time detection,” Signals, vol. 4, no. 1, art. no. 4, 2024.