论文标题

基于因果推理的根本原因分析在线服务系统具有干预识别

Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition

论文作者

Li, Mingjie, Li, Zeyan, Yin, Kanglin, Nie, Xiaohui, Zhang, Wenchi, Sui, Kaixin, Pei, Dan

论文摘要

故障诊断在许多领域至关重要,因为故障可能导致安全威胁或经济损失。在在线服务系统领域中,操作员依靠大量监视数据来检测和减轻故障。快速识别一组基本故障的根本原因指标可以节省大量时间减轻故障。在本文中,我们将根本原因分析问题作为一种新的因果推理任务,称为干预识别。我们提出了一种新型的无监督因果推理方法,名为基于因果推理的根本原因分析(大约)。核心思想是一个足够的条件,可以使监视变量成为根本原因指标,即,因果贝叶斯网络(CBN)中父母的概率分布的变化。朝着在线服务系统中的应用程序中,根据系统体系结构的知识和一组因果假设,在监视指标中构建图形。仿真研究说明了大约的理论可靠性。现实世界中数据集的性能进一步表明,大约可以比最佳基线方法提高25%的前1个建议的召回。

Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic losses. In the field of online service systems, operators rely on enormous monitoring data to detect and mitigate failures. Quickly recognizing a small set of root cause indicators for the underlying fault can save much time for failure mitigation. In this paper, we formulate the root cause analysis problem as a new causal inference task named intervention recognition. We proposed a novel unsupervised causal inference-based method named Causal Inference-based Root Cause Analysis (CIRCA). The core idea is a sufficient condition for a monitoring variable to be a root cause indicator, i.e., the change of probability distribution conditioned on the parents in the Causal Bayesian Network (CBN). Towards the application in online service systems, CIRCA constructs a graph among monitoring metrics based on the knowledge of system architecture and a set of causal assumptions. The simulation study illustrates the theoretical reliability of CIRCA. The performance on a real-world dataset further shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源