论文标题
立即行动:事件定位迹线的总比较
ACT now: Aggregate Comparison of Traces for Incident Localization
论文作者
论文摘要
生产系统中的事件很常见,停机时间很昂贵。快速采取适当的缓解措施,例如更改特定的防火墙规则,重新换变或将流量转移到其他可用性区域,从而节省了资金。事件定位是耗时的,因为单个故障可能会产生许多影响,远离故障部位。知道不同的系统事件相互关联是必要的,以快速识别\ emph {where}来减轻。我们的方法,迹线(ACT)的汇总比较,通过比较从最新的稳态操作和事件中取样的痕迹集(捕获事件及其关系的捕获事件及其关系)来定位事件。在我们的定量实验中,我们表明ACT能够有效地定位99%以上的事件。
Incidents in production systems are common and downtime is expensive. Applying an appropriate mitigating action quickly, such as changing a specific firewall rule, reverting a change, or diverting traffic to a different availability zone, saves money. Incident localization is time-consuming since a single failure can have many effects, extending far from the site of failure. Knowing how different system events relate to each other is necessary to quickly identify \emph{where} to mitigate. Our approach, Aggregate Comparison of Traces (ACT), localizes incidents by comparing sets of traces (which capture events and their relationships for individual requests) sampled from the most recent steady-state operation and during an incident. In our quantitative experiments, we show that ACT is able to effectively localize more than 99% of incidents.